[PATCH v2 0/6] Avoid cache trashing on clearing huge/gigantic page

2012-08-09 Thread Kirill A. Shutemov
From: "Kirill A. Shutemov" 

Clearing a 2MB huge page will typically blow away several levels of CPU
caches.  To avoid this only cache clear the 4K area around the fault
address and use a cache avoiding clears for the rest of the 2MB area.

This patchset implements cache avoiding version of clear_page only for
x86. If an architecture wants to provide cache avoiding version of
clear_page it should to define ARCH_HAS_USER_NOCACHE to 1 and implement
clear_page_nocache() and clear_user_highpage_nocache().

v2:
  - No code change. Only commit messages are updated.
  - RFC mark is dropped.

Andi Kleen (6):
  THP: Use real address for NUMA policy
  mm: make clear_huge_page tolerate non aligned address
  THP: Pass real, not rounded, address to clear_huge_page
  x86: Add clear_page_nocache
  mm: make clear_huge_page cache clear only around the fault address
  x86: switch the 64bit uncached page clear to SSE/AVX v2

 arch/x86/include/asm/page.h  |2 +
 arch/x86/include/asm/string_32.h |5 ++
 arch/x86/include/asm/string_64.h |5 ++
 arch/x86/lib/Makefile|1 +
 arch/x86/lib/clear_page_nocache_32.S |   30 +++
 arch/x86/lib/clear_page_nocache_64.S |   92 ++
 arch/x86/mm/fault.c  |7 +++
 mm/huge_memory.c |   17 +++---
 mm/memory.c  |   29 ++-
 9 files changed, 178 insertions(+), 10 deletions(-)
 create mode 100644 arch/x86/lib/clear_page_nocache_32.S
 create mode 100644 arch/x86/lib/clear_page_nocache_64.S

-- 
1.7.7.6

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v2 5/6] mm: make clear_huge_page cache clear only around the fault address

2012-08-09 Thread Kirill A. Shutemov
From: Andi Kleen 

Clearing a 2MB huge page will typically blow away several levels
of CPU caches. To avoid this only cache clear the 4K area
around the fault address and use a cache avoiding clears
for the rest of the 2MB area.

Signed-off-by: Andi Kleen 
Signed-off-by: Kirill A. Shutemov 
---
 mm/memory.c |   30 +++---
 1 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index b47199a..e9a75c2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3969,18 +3969,35 @@ EXPORT_SYMBOL(might_fault);
 #endif
 
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
+
+#ifndef ARCH_HAS_USER_NOCACHE
+#define ARCH_HAS_USER_NOCACHE 0
+#endif
+
+#if ARCH_HAS_USER_NOCACHE == 0
+#define clear_user_highpage_nocache clear_user_highpage
+#endif
+
 static void clear_gigantic_page(struct page *page,
unsigned long addr,
unsigned int pages_per_huge_page)
 {
int i;
struct page *p = page;
+   unsigned long vaddr;
+   unsigned long haddr = addr & HPAGE_PMD_MASK;
+   int target = (addr - haddr) >> PAGE_SHIFT;
 
might_sleep();
+   vaddr = haddr;
for (i = 0; i < pages_per_huge_page;
 i++, p = mem_map_next(p, page, i)) {
cond_resched();
-   clear_user_highpage(p, addr + i * PAGE_SIZE);
+   vaddr = haddr + i*PAGE_SIZE;
+   if (!ARCH_HAS_USER_NOCACHE  || i == target)
+   clear_user_highpage(p, vaddr);
+   else
+   clear_user_highpage_nocache(p, vaddr);
}
 }
 void clear_huge_page(struct page *page,
@@ -3988,16 +4005,23 @@ void clear_huge_page(struct page *page,
 {
int i;
unsigned long haddr = addr & HPAGE_PMD_MASK;
+   unsigned long vaddr;
+   int target = (addr - haddr) >> PAGE_SHIFT;
 
if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) {
-   clear_gigantic_page(page, haddr, pages_per_huge_page);
+   clear_gigantic_page(page, addr, pages_per_huge_page);
return;
}
 
might_sleep();
+   vaddr = haddr;
for (i = 0; i < pages_per_huge_page; i++) {
cond_resched();
-   clear_user_highpage(page + i, haddr + i * PAGE_SIZE);
+   vaddr = haddr + i*PAGE_SIZE;
+   if (!ARCH_HAS_USER_NOCACHE || i == target)
+   clear_user_highpage(page + i, vaddr);
+   else
+   clear_user_highpage_nocache(page + i, vaddr);
}
 }
 
-- 
1.7.7.6

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v2 6/6] x86: switch the 64bit uncached page clear to SSE/AVX v2

2012-08-09 Thread Kirill A. Shutemov
From: Andi Kleen 

With multiple threads vector stores are more efficient, so use them.
This will cause the page clear to run non preemptable and add some
overhead. However on 32bit it was already non preempable (due to
kmap_atomic) and there is an preemption opportunity every 4K unit.

On a NPB (Nasa Parallel Benchmark) 128GB run on a Westmere this improves
the performance regression of enabling transparent huge pages
by ~2% (2.81% to 0.81%), near the runtime variability now.
On a system with AVX support more is expected.

Signed-off-by: Andi Kleen 
[kirill.shute...@linux.intel.com: Properly save/restore arguments]
Signed-off-by: Kirill A. Shutemov 
---
 arch/x86/lib/clear_page_nocache_64.S |   91 -
 1 files changed, 77 insertions(+), 14 deletions(-)

diff --git a/arch/x86/lib/clear_page_nocache_64.S 
b/arch/x86/lib/clear_page_nocache_64.S
index ee16d15..c092919 100644
--- a/arch/x86/lib/clear_page_nocache_64.S
+++ b/arch/x86/lib/clear_page_nocache_64.S
@@ -1,29 +1,92 @@
+/*
+ * Clear pages with cache bypass.
+ * 
+ * Copyright (C) 2011, 2012 Intel Corporation
+ * Author: Andi Kleen
+ *
+ * This software may be redistributed and/or modified under the terms of
+ * the GNU General Public License ("GPL") version 2 only as published by the
+ * Free Software Foundation.
+ */
+
 #include 
+#include 
+#include 
 #include 
 
+#define SSE_UNROLL 128
+
 /*
  * Zero a page avoiding the caches
  * rdi page
  */
 ENTRY(clear_page_nocache)
CFI_STARTPROC
-   xorl   %eax,%eax
-   movl   $4096/64,%ecx
+   push   %rdi
+   call   kernel_fpu_begin
+   pop%rdi
+   sub$16,%rsp
+   CFI_ADJUST_CFA_OFFSET 16
+   movdqu %xmm0,(%rsp)
+   xorpd  %xmm0,%xmm0
+   movl   $4096/SSE_UNROLL,%ecx
.p2align 4
 .Lloop:
decl%ecx
-#define PUT(x) movnti %rax,x*8(%rdi)
-   movnti %rax,(%rdi)
-   PUT(1)
-   PUT(2)
-   PUT(3)
-   PUT(4)
-   PUT(5)
-   PUT(6)
-   PUT(7)
-   leaq64(%rdi),%rdi
+   .set x,0
+   .rept SSE_UNROLL/16
+   movntdq %xmm0,x(%rdi)
+   .set x,x+16
+   .endr
+   leaqSSE_UNROLL(%rdi),%rdi
jnz .Lloop
-   nop
-   ret
+   movdqu (%rsp),%xmm0
+   addq   $16,%rsp
+   CFI_ADJUST_CFA_OFFSET -16
+   jmp   kernel_fpu_end
CFI_ENDPROC
 ENDPROC(clear_page_nocache)
+
+#ifdef CONFIG_AS_AVX
+
+   .section .altinstr_replacement,"ax"
+1: .byte 0xeb  /* jmp  */
+   .byte (clear_page_nocache_avx - clear_page_nocache) - (2f - 1b)
+   /* offset */
+2:
+   .previous
+   .section .altinstructions,"a"
+   altinstruction_entry clear_page_nocache,1b,X86_FEATURE_AVX,\
+16, 2b-1b
+   .previous
+
+#define AVX_UNROLL 256 /* TUNE ME */
+
+ENTRY(clear_page_nocache_avx)
+   CFI_STARTPROC
+   push   %rdi
+   call   kernel_fpu_begin
+   pop%rdi
+   sub$32,%rsp
+   CFI_ADJUST_CFA_OFFSET 32
+   vmovdqu %ymm0,(%rsp)
+   vxorpd  %ymm0,%ymm0,%ymm0
+   movl   $4096/AVX_UNROLL,%ecx
+   .p2align 4
+.Lloop_avx:
+   decl%ecx
+   .set x,0
+   .rept AVX_UNROLL/32
+   vmovntdq %ymm0,x(%rdi)
+   .set x,x+32
+   .endr
+   leaqAVX_UNROLL(%rdi),%rdi
+   jnz .Lloop_avx
+   vmovdqu (%rsp),%ymm0
+   addq   $32,%rsp
+   CFI_ADJUST_CFA_OFFSET -32
+   jmp   kernel_fpu_end
+   CFI_ENDPROC
+ENDPROC(clear_page_nocache_avx)
+
+#endif
-- 
1.7.7.6

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v2 4/6] x86: Add clear_page_nocache

2012-08-09 Thread Kirill A. Shutemov
From: Andi Kleen 

Add a cache avoiding version of clear_page. Straight forward integer variant
of the existing 64bit clear_page, for both 32bit and 64bit.

Also add the necessary glue for highmem including a layer that non cache
coherent architectures that use the virtual address for flushing can
hook in. This is not needed on x86 of course.

If an architecture wants to provide cache avoiding version of clear_page
it should to define ARCH_HAS_USER_NOCACHE to 1 and implement
clear_page_nocache() and clear_user_highpage_nocache().

Signed-off-by: Andi Kleen 
Signed-off-by: Kirill A. Shutemov 
---
 arch/x86/include/asm/page.h  |2 ++
 arch/x86/include/asm/string_32.h |5 +
 arch/x86/include/asm/string_64.h |5 +
 arch/x86/lib/Makefile|1 +
 arch/x86/lib/clear_page_nocache_32.S |   30 ++
 arch/x86/lib/clear_page_nocache_64.S |   29 +
 arch/x86/mm/fault.c  |7 +++
 7 files changed, 79 insertions(+), 0 deletions(-)
 create mode 100644 arch/x86/lib/clear_page_nocache_32.S
 create mode 100644 arch/x86/lib/clear_page_nocache_64.S

diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
index 8ca8283..aa83a1b 100644
--- a/arch/x86/include/asm/page.h
+++ b/arch/x86/include/asm/page.h
@@ -29,6 +29,8 @@ static inline void copy_user_page(void *to, void *from, 
unsigned long vaddr,
copy_page(to, from);
 }
 
+void clear_user_highpage_nocache(struct page *page, unsigned long vaddr);
+
 #define __alloc_zeroed_user_highpage(movableflags, vma, vaddr) \
alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO | movableflags, vma, vaddr)
 #define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
diff --git a/arch/x86/include/asm/string_32.h b/arch/x86/include/asm/string_32.h
index 3d3e835..3f2fbcf 100644
--- a/arch/x86/include/asm/string_32.h
+++ b/arch/x86/include/asm/string_32.h
@@ -3,6 +3,8 @@
 
 #ifdef __KERNEL__
 
+#include 
+
 /* Let gcc decide whether to inline or use the out of line functions */
 
 #define __HAVE_ARCH_STRCPY
@@ -337,6 +339,9 @@ void *__constant_c_and_count_memset(void *s, unsigned long 
pattern,
 #define __HAVE_ARCH_MEMSCAN
 extern void *memscan(void *addr, int c, size_t size);
 
+#define ARCH_HAS_USER_NOCACHE 1
+asmlinkage void clear_page_nocache(void *page);
+
 #endif /* __KERNEL__ */
 
 #endif /* _ASM_X86_STRING_32_H */
diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index 19e2c46..ca23d1d 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -3,6 +3,8 @@
 
 #ifdef __KERNEL__
 
+#include 
+
 /* Written 2002 by Andi Kleen */
 
 /* Only used for special circumstances. Stolen from i386/string.h */
@@ -63,6 +65,9 @@ char *strcpy(char *dest, const char *src);
 char *strcat(char *dest, const char *src);
 int strcmp(const char *cs, const char *ct);
 
+#define ARCH_HAS_USER_NOCACHE 1
+asmlinkage void clear_page_nocache(void *page);
+
 #endif /* __KERNEL__ */
 
 #endif /* _ASM_X86_STRING_64_H */
diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile
index b00f678..a8ad6dd 100644
--- a/arch/x86/lib/Makefile
+++ b/arch/x86/lib/Makefile
@@ -23,6 +23,7 @@ lib-y += memcpy_$(BITS).o
 lib-$(CONFIG_SMP) += rwlock.o
 lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o
 lib-$(CONFIG_INSTRUCTION_DECODER) += insn.o inat.o
+lib-y += clear_page_nocache_$(BITS).o
 
 obj-y += msr.o msr-reg.o msr-reg-export.o
 
diff --git a/arch/x86/lib/clear_page_nocache_32.S 
b/arch/x86/lib/clear_page_nocache_32.S
new file mode 100644
index 000..2394e0c
--- /dev/null
+++ b/arch/x86/lib/clear_page_nocache_32.S
@@ -0,0 +1,30 @@
+#include 
+#include 
+
+/*
+ * Zero a page avoiding the caches
+ * rdi page
+ */
+ENTRY(clear_page_nocache)
+   CFI_STARTPROC
+   mov%eax,%edi
+   xorl   %eax,%eax
+   movl   $4096/64,%ecx
+   .p2align 4
+.Lloop:
+   decl%ecx
+#define PUT(x) movnti %eax,x*8(%edi) ; movnti %eax,x*8+4(%edi)
+   PUT(0)
+   PUT(1)
+   PUT(2)
+   PUT(3)
+   PUT(4)
+   PUT(5)
+   PUT(6)
+   PUT(7)
+   lea 64(%edi),%edi
+   jnz .Lloop
+   nop
+   ret
+   CFI_ENDPROC
+ENDPROC(clear_page_nocache)
diff --git a/arch/x86/lib/clear_page_nocache_64.S 
b/arch/x86/lib/clear_page_nocache_64.S
new file mode 100644
index 000..ee16d15
--- /dev/null
+++ b/arch/x86/lib/clear_page_nocache_64.S
@@ -0,0 +1,29 @@
+#include 
+#include 
+
+/*
+ * Zero a page avoiding the caches
+ * rdi page
+ */
+ENTRY(clear_page_nocache)
+   CFI_STARTPROC
+   xorl   %eax,%eax
+   movl   $4096/64,%ecx
+   .p2align 4
+.Lloop:
+   decl%ecx
+#define PUT(x) movnti %rax,x*8(%rdi)
+   movnti %rax,(%rdi)
+   PUT(1)
+   PUT(2)
+   PUT(3)
+   PUT(4)
+   PUT(5)
+   PUT(6)
+   PUT(7)
+   leaq64(%rdi),%rdi
+   jnz .Lloop
+   nop
+   ret
+   CFI_ENDPROC
+ENDPROC(clear_page_nocache)
diff --git a/arch/x86/mm

[PATCH v2 2/6] mm: make clear_huge_page tolerate non aligned address

2012-08-09 Thread Kirill A. Shutemov
From: Andi Kleen 

hugetlb does not necessarily pass in an aligned address, so the
low level address computation is wrong.

This will fix architectures that actually use the address for flushing
the cleared address (very few, like xtensa/sparc/...?)

Signed-off-by: Andi Kleen 
Signed-off-by: Kirill A. Shutemov 
---
 mm/memory.c |5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 5736170..b47199a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3987,16 +3987,17 @@ void clear_huge_page(struct page *page,
 unsigned long addr, unsigned int pages_per_huge_page)
 {
int i;
+   unsigned long haddr = addr & HPAGE_PMD_MASK;
 
if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) {
-   clear_gigantic_page(page, addr, pages_per_huge_page);
+   clear_gigantic_page(page, haddr, pages_per_huge_page);
return;
}
 
might_sleep();
for (i = 0; i < pages_per_huge_page; i++) {
cond_resched();
-   clear_user_highpage(page + i, addr + i * PAGE_SIZE);
+   clear_user_highpage(page + i, haddr + i * PAGE_SIZE);
}
 }
 
-- 
1.7.7.6

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v2 3/6] THP: Pass real, not rounded, address to clear_huge_page

2012-08-09 Thread Kirill A. Shutemov
From: Andi Kleen 

Signed-off-by: Andi Kleen 
Signed-off-by: Kirill A. Shutemov 
---
 mm/huge_memory.c |9 +
 1 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 70737ec..ecd93f8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -633,7 +633,8 @@ static inline pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct 
vm_area_struct *vma)
 
 static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
struct vm_area_struct *vma,
-   unsigned long haddr, pmd_t *pmd,
+   unsigned long haddr,
+   unsigned long address, pmd_t *pmd,
struct page *page)
 {
pgtable_t pgtable;
@@ -643,7 +644,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct 
*mm,
if (unlikely(!pgtable))
return VM_FAULT_OOM;
 
-   clear_huge_page(page, haddr, HPAGE_PMD_NR);
+   clear_huge_page(page, address, HPAGE_PMD_NR);
__SetPageUptodate(page);
 
spin_lock(&mm->page_table_lock);
@@ -720,8 +721,8 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
put_page(page);
goto out;
}
-   if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr, pmd,
- page))) {
+   if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr,
+   address, pmd, page))) {
mem_cgroup_uncharge_page(page);
put_page(page);
goto out;
-- 
1.7.7.6

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v2 1/6] THP: Use real address for NUMA policy

2012-08-09 Thread Kirill A. Shutemov
From: Andi Kleen 

Use the fault address, not the rounded down hpage address for NUMA
policy purposes. In some circumstances this can give more exact
NUMA policy.

Signed-off-by: Andi Kleen 
Signed-off-by: Kirill A. Shutemov 
---
 mm/huge_memory.c |8 
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 57c4b93..70737ec 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -681,11 +681,11 @@ static inline gfp_t alloc_hugepage_gfpmask(int defrag, 
gfp_t extra_gfp)
 
 static inline struct page *alloc_hugepage_vma(int defrag,
  struct vm_area_struct *vma,
- unsigned long haddr, int nd,
+ unsigned long address, int nd,
  gfp_t extra_gfp)
 {
return alloc_pages_vma(alloc_hugepage_gfpmask(defrag, extra_gfp),
-  HPAGE_PMD_ORDER, vma, haddr, nd);
+  HPAGE_PMD_ORDER, vma, address, nd);
 }
 
 #ifndef CONFIG_NUMA
@@ -710,7 +710,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
if (unlikely(khugepaged_enter(vma)))
return VM_FAULT_OOM;
page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
- vma, haddr, numa_node_id(), 0);
+ vma, address, numa_node_id(), 0);
if (unlikely(!page)) {
count_vm_event(THP_FAULT_FALLBACK);
goto out;
@@ -944,7 +944,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
if (transparent_hugepage_enabled(vma) &&
!transparent_hugepage_debug_cow())
new_page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
- vma, haddr, numa_node_id(), 0);
+ vma, address, numa_node_id(), 0);
else
new_page = NULL;
 
-- 
1.7.7.6

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v2 4/6] x86: Add clear_page_nocache

2012-08-13 Thread Kirill A. Shutemov
On Thu, Aug 09, 2012 at 04:22:04PM +0100, Jan Beulich wrote:
> >>> On 09.08.12 at 17:03, "Kirill A. Shutemov" 
> >>>  wrote:

...

> > ---
> >  arch/x86/include/asm/page.h  |2 ++
> >  arch/x86/include/asm/string_32.h |5 +
> >  arch/x86/include/asm/string_64.h |5 +
> >  arch/x86/lib/Makefile|1 +
> >  arch/x86/lib/clear_page_nocache_32.S |   30 ++
> >  arch/x86/lib/clear_page_nocache_64.S |   29 +
> 
> Couldn't this more reasonably go into clear_page_{32,64}.S?

We don't have clear_page_32.S.

> >+xorl   %eax,%eax
> >+movl   $4096/64,%ecx
> >+.p2align 4
> >+.Lloop:
> >+decl%ecx
> >+#define PUT(x) movnti %eax,x*8(%edi) ; movnti %eax,x*8+4(%edi)
> 
> Is doing twice as much unrolling as on 64-bit really worth it?

Moving 64 bytes per cycle is faster on Sandy Bridge, but slower on
Westmere. Any preference? ;)

Westmere:

 Performance counter stats for './test_unroll32' (20 runs):

  31498.420608 task-clock#0.998 CPUs utilized   
 ( +-  0.25% )
40 context-switches  #0.001 K/sec   
 ( +-  1.40% )
 0 CPU-migrations#0.000 K/sec   
 ( +-100.00% )
89 page-faults   #0.003 K/sec   
 ( +-  0.13% )
74,728,231,935 cycles#2.372 GHz 
 ( +-  0.25% ) [83.34%]
53,789,969,009 stalled-cycles-frontend   #   71.98% frontend cycles idle
 ( +-  0.35% ) [83.33%]
41,681,014,054 stalled-cycles-backend#   55.78% backend  cycles idle
 ( +-  0.43% ) [66.67%]
37,992,733,278 instructions  #0.51  insns per cycle
 #1.42  stalled cycles per insn 
 ( +-  0.05% ) [83.33%]
 3,561,376,245 branches  #  113.065 M/sec   
 ( +-  0.05% ) [83.33%]
27,182,795 branch-misses #0.76% of all branches 
 ( +-  0.06% ) [83.33%]

  31.558545812 seconds time elapsed 
 ( +-  0.25% )

 Performance counter stats for './test_unroll64' (20 runs):

  31564.753623 task-clock#0.998 CPUs utilized   
 ( +-  0.19% )
39 context-switches  #0.001 K/sec   
 ( +-  0.40% )
 0 CPU-migrations#0.000 K/sec
90 page-faults   #0.003 K/sec   
 ( +-  0.12% )
74,886,045,192 cycles#2.372 GHz 
 ( +-  0.19% ) [83.33%]
57,477,323,995 stalled-cycles-frontend   #   76.75% frontend cycles idle
 ( +-  0.26% ) [83.34%]
44,548,142,150 stalled-cycles-backend#   59.49% backend  cycles idle
 ( +-  0.31% ) [66.67%]
32,940,027,099 instructions  #0.44  insns per cycle
 #1.74  stalled cycles per insn 
 ( +-  0.05% ) [83.34%]
 1,884,944,093 branches  #   59.717 M/sec   
 ( +-  0.05% ) [83.32%]
 1,027,135 branch-misses #0.05% of all branches 
 ( +-  0.56% ) [83.34%]

  31.621001407 seconds time elapsed 
 ( +-  0.19% )

Sandy Bridge:

 Performance counter stats for './test_unroll32' (20 runs):

   8578.382891 task-clock#0.997 CPUs utilized   
 ( +-  0.08% )
15 context-switches  #0.000 M/sec   
 ( +-  2.97% )
 0 CPU-migrations#0.000 M/sec
84 page-faults   #0.000 M/sec   
 ( +-  0.13% )
29,154,476,597 cycles#3.399 GHz 
 ( +-  0.08% ) [83.33%]
11,851,215,147 stalled-cycles-frontend   #   40.65% frontend cycles idle
 ( +-  0.20% ) [83.33%]
 1,530,172,593 stalled-cycles-backend#5.25% backend  cycles idle
 ( +-  1.44% ) [66.67%]
37,915,778,094 instructions  #1.30  insns per cycle
 #0.31  stalled cycles per insn 
 ( +-  0.00% ) [83.34%]
 3,590,533,447 branches  #  418.556 M/sec   
 ( +-  0.01% ) [83.35%]
26,500,765 branch-misses #0.74% of all branches 
 ( +-  0.01% ) [83.34%]

   8.604638449 seconds time elapsed 
 ( +-  0.08% )

 Performance counter stats for './test_unroll64' (20 runs):

   8463.789963 task-clock#0.997 CPUs utilized   
 ( +-  0.07% )
  

Re: [PATCH v2 4/6] x86: Add clear_page_nocache

2012-08-13 Thread Kirill A. Shutemov
On Mon, Aug 13, 2012 at 07:04:02PM +0200, Borislav Petkov wrote:
> On Mon, Aug 13, 2012 at 02:43:34PM +0300, Kirill A. Shutemov wrote:
> > $ cat test.c
> > #include 
> > #include 
> > 
> > #define SIZE 1024*1024*1024
> > 
> > void clear_page_nocache_sse2(void *page) __attribute__((regparm(1)));
> > 
> > int main(int argc, char** argv)
> > {
> > char *p;
> > unsigned long i, j;
> > 
> > p = mmap(NULL, SIZE, PROT_WRITE|PROT_READ,
> > MAP_PRIVATE|MAP_ANONYMOUS|MAP_POPULATE, -1, 0);
> > for(j = 0; j < 100; j++) {
> > for(i = 0; i < SIZE; i += 4096) {
> > clear_page_nocache_sse2(p + i);
> > }
> > }
> > 
> > return 0;
> > }
> > $ cat clear_page_nocache_unroll32.S
> > .globl clear_page_nocache_sse2
> > .align 4,0x90
> > clear_page_nocache_sse2:
> > .cfi_startproc
> > mov%eax,%edx
> > xorl   %eax,%eax
> > movl   $4096/32,%ecx
> > .p2align 4
> > .Lloop_sse2:
> > decl%ecx
> > #define PUT(x) movnti %eax,x*4(%edx)
> > PUT(0)
> > PUT(1)
> > PUT(2)
> > PUT(3)
> > PUT(4)
> > PUT(5)
> > PUT(6)
> > PUT(7)
> > #undef PUT
> > lea 32(%edx),%edx
> > jnz .Lloop_sse2
> > nop
> > ret
> > .cfi_endproc
> > .type clear_page_nocache_sse2, @function
> > .size clear_page_nocache_sse2, .-clear_page_nocache_sse2
> > $ cat clear_page_nocache_unroll64.S
> > .globl clear_page_nocache_sse2
> > .align 4,0x90
> > clear_page_nocache_sse2:
> > .cfi_startproc
> > mov%eax,%edx
> 
> This must still be the 32-bit version becaue it segfaults here.

Yes, it's test for 32-bit version.

-- 
 Kirill A. Shutemov
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v3 1/7] THP: Use real address for NUMA policy

2012-08-16 Thread Kirill A. Shutemov
From: Andi Kleen 

Use the fault address, not the rounded down hpage address for NUMA
policy purposes. In some circumstances this can give more exact
NUMA policy.

Signed-off-by: Andi Kleen 
Signed-off-by: Kirill A. Shutemov 
---
 mm/huge_memory.c |8 
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 57c4b93..70737ec 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -681,11 +681,11 @@ static inline gfp_t alloc_hugepage_gfpmask(int defrag, 
gfp_t extra_gfp)
 
 static inline struct page *alloc_hugepage_vma(int defrag,
  struct vm_area_struct *vma,
- unsigned long haddr, int nd,
+ unsigned long address, int nd,
  gfp_t extra_gfp)
 {
return alloc_pages_vma(alloc_hugepage_gfpmask(defrag, extra_gfp),
-  HPAGE_PMD_ORDER, vma, haddr, nd);
+  HPAGE_PMD_ORDER, vma, address, nd);
 }
 
 #ifndef CONFIG_NUMA
@@ -710,7 +710,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
if (unlikely(khugepaged_enter(vma)))
return VM_FAULT_OOM;
page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
- vma, haddr, numa_node_id(), 0);
+ vma, address, numa_node_id(), 0);
if (unlikely(!page)) {
count_vm_event(THP_FAULT_FALLBACK);
goto out;
@@ -944,7 +944,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
if (transparent_hugepage_enabled(vma) &&
!transparent_hugepage_debug_cow())
new_page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
- vma, haddr, numa_node_id(), 0);
+ vma, address, numa_node_id(), 0);
else
new_page = NULL;
 
-- 
1.7.7.6

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v3 7/7] x86: switch the 64bit uncached page clear to SSE/AVX v2

2012-08-16 Thread Kirill A. Shutemov
From: Andi Kleen 

With multiple threads vector stores are more efficient, so use them.
This will cause the page clear to run non preemptable and add some
overhead. However on 32bit it was already non preempable (due to
kmap_atomic) and there is an preemption opportunity every 4K unit.

On a NPB (Nasa Parallel Benchmark) 128GB run on a Westmere this improves
the performance regression of enabling transparent huge pages
by ~2% (2.81% to 0.81%), near the runtime variability now.
On a system with AVX support more is expected.

Signed-off-by: Andi Kleen 
[kirill.shute...@linux.intel.com: Properly save/restore arguments]
Signed-off-by: Kirill A. Shutemov 
---
 arch/x86/lib/clear_page_64.S |   79 ++
 1 files changed, 64 insertions(+), 15 deletions(-)

diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
index 9d2f3c2..b302cff 100644
--- a/arch/x86/lib/clear_page_64.S
+++ b/arch/x86/lib/clear_page_64.S
@@ -73,30 +73,79 @@ ENDPROC(clear_page)
 .Lclear_page_end-clear_page,3b-2b
.previous
 
+#define SSE_UNROLL 128
+
 /*
  * Zero a page avoiding the caches
  * rdi page
  */
 ENTRY(clear_page_nocache)
CFI_STARTPROC
-   xorl   %eax,%eax
-   movl   $4096/64,%ecx
+   pushq_cfi %rdi
+   call   kernel_fpu_begin
+   popq_cfi  %rdi
+   sub$16,%rsp
+   CFI_ADJUST_CFA_OFFSET 16
+   movdqu %xmm0,(%rsp)
+   xorpd  %xmm0,%xmm0
+   movl   $4096/SSE_UNROLL,%ecx
.p2align 4
 .Lloop_nocache:
decl%ecx
-#define PUT(x) movnti %rax,x*8(%rdi)
-   movnti %rax,(%rdi)
-   PUT(1)
-   PUT(2)
-   PUT(3)
-   PUT(4)
-   PUT(5)
-   PUT(6)
-   PUT(7)
-#undef PUT
-   leaq64(%rdi),%rdi
+   .set x,0
+   .rept SSE_UNROLL/16
+   movntdq %xmm0,x(%rdi)
+   .set x,x+16
+   .endr
+   leaqSSE_UNROLL(%rdi),%rdi
jnz .Lloop_nocache
-   nop
-   ret
+   movdqu (%rsp),%xmm0
+   addq   $16,%rsp
+   CFI_ADJUST_CFA_OFFSET -16
+   jmp   kernel_fpu_end
CFI_ENDPROC
 ENDPROC(clear_page_nocache)
+
+#ifdef CONFIG_AS_AVX
+
+   .section .altinstr_replacement,"ax"
+1: .byte 0xeb  /* jmp  */
+   .byte (clear_page_nocache_avx - clear_page_nocache) - (2f - 1b)
+   /* offset */
+2:
+   .previous
+   .section .altinstructions,"a"
+   altinstruction_entry clear_page_nocache,1b,X86_FEATURE_AVX,\
+16, 2b-1b
+   .previous
+
+#define AVX_UNROLL 256 /* TUNE ME */
+
+ENTRY(clear_page_nocache_avx)
+   CFI_STARTPROC
+   pushq_cfi %rdi
+   call   kernel_fpu_begin
+   popq_cfi  %rdi
+   sub$32,%rsp
+   CFI_ADJUST_CFA_OFFSET 32
+   vmovdqu %ymm0,(%rsp)
+   vxorpd  %ymm0,%ymm0,%ymm0
+   movl   $4096/AVX_UNROLL,%ecx
+   .p2align 4
+.Lloop_avx:
+   decl%ecx
+   .set x,0
+   .rept AVX_UNROLL/32
+   vmovntdq %ymm0,x(%rdi)
+   .set x,x+32
+   .endr
+   leaqAVX_UNROLL(%rdi),%rdi
+   jnz .Lloop_avx
+   vmovdqu (%rsp),%ymm0
+   addq   $32,%rsp
+   CFI_ADJUST_CFA_OFFSET -32
+   jmp   kernel_fpu_end
+   CFI_ENDPROC
+ENDPROC(clear_page_nocache_avx)
+
+#endif
-- 
1.7.7.6

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v3 4/7] mm: pass fault address to clear_huge_page()

2012-08-16 Thread Kirill A. Shutemov
From: "Kirill A. Shutemov" 

Signed-off-by: Kirill A. Shutemov 
---
 include/linux/mm.h |2 +-
 mm/huge_memory.c   |2 +-
 mm/hugetlb.c   |3 ++-
 mm/memory.c|7 ---
 4 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 311be90..2858723 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1638,7 +1638,7 @@ extern void dump_page(struct page *page);
 
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
 extern void clear_huge_page(struct page *page,
-   unsigned long addr,
+   unsigned long haddr, unsigned long fault_address,
unsigned int pages_per_huge_page);
 extern void copy_user_huge_page(struct page *dst, struct page *src,
unsigned long addr, struct vm_area_struct *vma,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 6f0825b611..070bf89 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -644,7 +644,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct 
*mm,
if (unlikely(!pgtable))
return VM_FAULT_OOM;
 
-   clear_huge_page(page, haddr, HPAGE_PMD_NR);
+   clear_huge_page(page, haddr, address, HPAGE_PMD_NR);
__SetPageUptodate(page);
 
spin_lock(&mm->page_table_lock);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3c86d3d..5182192 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2718,7 +2718,8 @@ retry:
ret = VM_FAULT_SIGBUS;
goto out;
}
-   clear_huge_page(page, haddr, pages_per_huge_page(h));
+   clear_huge_page(page, haddr, fault_address,
+   pages_per_huge_page(h));
__SetPageUptodate(page);
 
if (vma->vm_flags & VM_MAYSHARE) {
diff --git a/mm/memory.c b/mm/memory.c
index 5736170..dfc179b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3984,19 +3984,20 @@ static void clear_gigantic_page(struct page *page,
}
 }
 void clear_huge_page(struct page *page,
-unsigned long addr, unsigned int pages_per_huge_page)
+unsigned long haddr, unsigned long fault_address,
+unsigned int pages_per_huge_page)
 {
int i;
 
if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) {
-   clear_gigantic_page(page, addr, pages_per_huge_page);
+   clear_gigantic_page(page, haddr, pages_per_huge_page);
return;
}
 
might_sleep();
for (i = 0; i < pages_per_huge_page; i++) {
cond_resched();
-   clear_user_highpage(page + i, addr + i * PAGE_SIZE);
+   clear_user_highpage(page + i, haddr + i * PAGE_SIZE);
}
 }
 
-- 
1.7.7.6

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v3 0/7] Avoid cache trashing on clearing huge/gigantic page

2012-08-16 Thread Kirill A. Shutemov
From: "Kirill A. Shutemov" 

Clearing a 2MB huge page will typically blow away several levels of CPU
caches.  To avoid this only cache clear the 4K area around the fault
address and use a cache avoiding clears for the rest of the 2MB area.

This patchset implements cache avoiding version of clear_page only for
x86. If an architecture wants to provide cache avoiding version of
clear_page it should to define ARCH_HAS_USER_NOCACHE to 1 and implement
clear_page_nocache() and clear_user_highpage_nocache().

v3:
  - Rebased to current Linus' tree. kmap_atomic() build issue is fixed;
  - Pass fault address to clear_huge_page(). v2 had problem with clearing
for sizes other than HPAGE_SIZE
  - x86: fix 32bit variant. Fallback version of clear_page_nocache() has
been added for non-SSE2 systems;
  - x86: clear_page_nocache() moved to clear_page_{32,64}.S;
  - x86: use pushq_cfi/popq_cfi instead of push/pop;
v2:
  - No code change. Only commit messages are updated.
  - RFC mark is dropped.

Andi Kleen (5):
  THP: Use real address for NUMA policy
  THP: Pass fault address to __do_huge_pmd_anonymous_page()
  x86: Add clear_page_nocache
  mm: make clear_huge_page cache clear only around the fault address
  x86: switch the 64bit uncached page clear to SSE/AVX v2

Kirill A. Shutemov (2):
  hugetlb: pass fault address to hugetlb_no_page()
  mm: pass fault address to clear_huge_page()

 arch/x86/include/asm/page.h  |2 +
 arch/x86/include/asm/string_32.h |5 ++
 arch/x86/include/asm/string_64.h |5 ++
 arch/x86/lib/Makefile|3 +-
 arch/x86/lib/clear_page_32.S |   72 +++
 arch/x86/lib/clear_page_64.S |   78 ++
 arch/x86/mm/fault.c  |7 +++
 include/linux/mm.h   |2 +-
 mm/huge_memory.c |   17 
 mm/hugetlb.c |   39 ++-
 mm/memory.c  |   37 +++---
 11 files changed, 232 insertions(+), 35 deletions(-)
 create mode 100644 arch/x86/lib/clear_page_32.S

-- 
1.7.7.6

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v3 2/7] THP: Pass fault address to __do_huge_pmd_anonymous_page()

2012-08-16 Thread Kirill A. Shutemov
From: Andi Kleen 

Signed-off-by: Andi Kleen 
Signed-off-by: Kirill A. Shutemov 
---
 mm/huge_memory.c |7 ---
 1 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 70737ec..6f0825b611 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -633,7 +633,8 @@ static inline pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct 
vm_area_struct *vma)
 
 static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
struct vm_area_struct *vma,
-   unsigned long haddr, pmd_t *pmd,
+   unsigned long haddr,
+   unsigned long address, pmd_t *pmd,
struct page *page)
 {
pgtable_t pgtable;
@@ -720,8 +721,8 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
put_page(page);
goto out;
}
-   if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr, pmd,
- page))) {
+   if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr,
+   address, pmd, page))) {
mem_cgroup_uncharge_page(page);
put_page(page);
goto out;
-- 
1.7.7.6

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v3 3/7] hugetlb: pass fault address to hugetlb_no_page()

2012-08-16 Thread Kirill A. Shutemov
From: "Kirill A. Shutemov" 

Signed-off-by: Kirill A. Shutemov 
---
 mm/hugetlb.c |   38 +++---
 1 files changed, 19 insertions(+), 19 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index bc72712..3c86d3d 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2672,7 +2672,8 @@ static bool hugetlbfs_pagecache_present(struct hstate *h,
 }
 
 static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
-   unsigned long address, pte_t *ptep, unsigned int flags)
+   unsigned long haddr, unsigned long fault_address,
+   pte_t *ptep, unsigned int flags)
 {
struct hstate *h = hstate_vma(vma);
int ret = VM_FAULT_SIGBUS;
@@ -2696,7 +2697,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
}
 
mapping = vma->vm_file->f_mapping;
-   idx = vma_hugecache_offset(h, vma, address);
+   idx = vma_hugecache_offset(h, vma, haddr);
 
/*
 * Use page lock to guard against racing truncation
@@ -2708,7 +2709,7 @@ retry:
size = i_size_read(mapping->host) >> huge_page_shift(h);
if (idx >= size)
goto out;
-   page = alloc_huge_page(vma, address, 0);
+   page = alloc_huge_page(vma, haddr, 0);
if (IS_ERR(page)) {
ret = PTR_ERR(page);
if (ret == -ENOMEM)
@@ -2717,7 +2718,7 @@ retry:
ret = VM_FAULT_SIGBUS;
goto out;
}
-   clear_huge_page(page, address, pages_per_huge_page(h));
+   clear_huge_page(page, haddr, pages_per_huge_page(h));
__SetPageUptodate(page);
 
if (vma->vm_flags & VM_MAYSHARE) {
@@ -2763,7 +2764,7 @@ retry:
 * the spinlock.
 */
if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED))
-   if (vma_needs_reservation(h, vma, address) < 0) {
+   if (vma_needs_reservation(h, vma, haddr) < 0) {
ret = VM_FAULT_OOM;
goto backout_unlocked;
}
@@ -2778,16 +2779,16 @@ retry:
goto backout;
 
if (anon_rmap)
-   hugepage_add_new_anon_rmap(page, vma, address);
+   hugepage_add_new_anon_rmap(page, vma, haddr);
else
page_dup_rmap(page);
new_pte = make_huge_pte(vma, page, ((vma->vm_flags & VM_WRITE)
&& (vma->vm_flags & VM_SHARED)));
-   set_huge_pte_at(mm, address, ptep, new_pte);
+   set_huge_pte_at(mm, haddr, ptep, new_pte);
 
if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
/* Optimization, do the COW without a second fault */
-   ret = hugetlb_cow(mm, vma, address, ptep, new_pte, page);
+   ret = hugetlb_cow(mm, vma, haddr, ptep, new_pte, page);
}
 
spin_unlock(&mm->page_table_lock);
@@ -2813,21 +2814,20 @@ int hugetlb_fault(struct mm_struct *mm, struct 
vm_area_struct *vma,
struct page *pagecache_page = NULL;
static DEFINE_MUTEX(hugetlb_instantiation_mutex);
struct hstate *h = hstate_vma(vma);
+   unsigned long haddr = address & huge_page_mask(h);
 
-   address &= huge_page_mask(h);
-
-   ptep = huge_pte_offset(mm, address);
+   ptep = huge_pte_offset(mm, haddr);
if (ptep) {
entry = huge_ptep_get(ptep);
if (unlikely(is_hugetlb_entry_migration(entry))) {
-   migration_entry_wait(mm, (pmd_t *)ptep, address);
+   migration_entry_wait(mm, (pmd_t *)ptep, haddr);
return 0;
} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
return VM_FAULT_HWPOISON_LARGE |
VM_FAULT_SET_HINDEX(hstate_index(h));
}
 
-   ptep = huge_pte_alloc(mm, address, huge_page_size(h));
+   ptep = huge_pte_alloc(mm, haddr, huge_page_size(h));
if (!ptep)
return VM_FAULT_OOM;
 
@@ -2839,7 +2839,7 @@ int hugetlb_fault(struct mm_struct *mm, struct 
vm_area_struct *vma,
mutex_lock(&hugetlb_instantiation_mutex);
entry = huge_ptep_get(ptep);
if (huge_pte_none(entry)) {
-   ret = hugetlb_no_page(mm, vma, address, ptep, flags);
+   ret = hugetlb_no_page(mm, vma, haddr, address, ptep, flags);
goto out_mutex;
}
 
@@ -2854,14 +2854,14 @@ int hugetlb_fault(struct mm_struct *mm, struct 
vm_area_struct *vma,
 * consumed.
 */
if ((flags & FAULT_FLAG_WRITE) && !pte_write(entry)) {
-   if (vma_needs_reservatio

[PATCH v3 5/7] x86: Add clear_page_nocache

2012-08-16 Thread Kirill A. Shutemov
From: Andi Kleen 

Add a cache avoiding version of clear_page. Straight forward integer variant
of the existing 64bit clear_page, for both 32bit and 64bit.

Also add the necessary glue for highmem including a layer that non cache
coherent architectures that use the virtual address for flushing can
hook in. This is not needed on x86 of course.

If an architecture wants to provide cache avoiding version of clear_page
it should to define ARCH_HAS_USER_NOCACHE to 1 and implement
clear_page_nocache() and clear_user_highpage_nocache().

Signed-off-by: Andi Kleen 
Signed-off-by: Kirill A. Shutemov 
---
 arch/x86/include/asm/page.h  |2 +
 arch/x86/include/asm/string_32.h |5 +++
 arch/x86/include/asm/string_64.h |5 +++
 arch/x86/lib/Makefile|3 +-
 arch/x86/lib/clear_page_32.S |   72 ++
 arch/x86/lib/clear_page_64.S |   29 +++
 arch/x86/mm/fault.c  |7 
 7 files changed, 122 insertions(+), 1 deletions(-)
 create mode 100644 arch/x86/lib/clear_page_32.S

diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
index 8ca8283..aa83a1b 100644
--- a/arch/x86/include/asm/page.h
+++ b/arch/x86/include/asm/page.h
@@ -29,6 +29,8 @@ static inline void copy_user_page(void *to, void *from, 
unsigned long vaddr,
copy_page(to, from);
 }
 
+void clear_user_highpage_nocache(struct page *page, unsigned long vaddr);
+
 #define __alloc_zeroed_user_highpage(movableflags, vma, vaddr) \
alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO | movableflags, vma, vaddr)
 #define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
diff --git a/arch/x86/include/asm/string_32.h b/arch/x86/include/asm/string_32.h
index 3d3e835..3f2fbcf 100644
--- a/arch/x86/include/asm/string_32.h
+++ b/arch/x86/include/asm/string_32.h
@@ -3,6 +3,8 @@
 
 #ifdef __KERNEL__
 
+#include 
+
 /* Let gcc decide whether to inline or use the out of line functions */
 
 #define __HAVE_ARCH_STRCPY
@@ -337,6 +339,9 @@ void *__constant_c_and_count_memset(void *s, unsigned long 
pattern,
 #define __HAVE_ARCH_MEMSCAN
 extern void *memscan(void *addr, int c, size_t size);
 
+#define ARCH_HAS_USER_NOCACHE 1
+asmlinkage void clear_page_nocache(void *page);
+
 #endif /* __KERNEL__ */
 
 #endif /* _ASM_X86_STRING_32_H */
diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index 19e2c46..ca23d1d 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -3,6 +3,8 @@
 
 #ifdef __KERNEL__
 
+#include 
+
 /* Written 2002 by Andi Kleen */
 
 /* Only used for special circumstances. Stolen from i386/string.h */
@@ -63,6 +65,9 @@ char *strcpy(char *dest, const char *src);
 char *strcat(char *dest, const char *src);
 int strcmp(const char *cs, const char *ct);
 
+#define ARCH_HAS_USER_NOCACHE 1
+asmlinkage void clear_page_nocache(void *page);
+
 #endif /* __KERNEL__ */
 
 #endif /* _ASM_X86_STRING_64_H */
diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile
index b00f678..14e47a2 100644
--- a/arch/x86/lib/Makefile
+++ b/arch/x86/lib/Makefile
@@ -23,6 +23,7 @@ lib-y += memcpy_$(BITS).o
 lib-$(CONFIG_SMP) += rwlock.o
 lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o
 lib-$(CONFIG_INSTRUCTION_DECODER) += insn.o inat.o
+lib-y += clear_page_$(BITS).o
 
 obj-y += msr.o msr-reg.o msr-reg-export.o
 
@@ -40,7 +41,7 @@ endif
 else
 obj-y += iomap_copy_64.o
 lib-y += csum-partial_64.o csum-copy_64.o csum-wrappers_64.o
-lib-y += thunk_64.o clear_page_64.o copy_page_64.o
+lib-y += thunk_64.o copy_page_64.o
 lib-y += memmove_64.o memset_64.o
 lib-y += copy_user_64.o copy_user_nocache_64.o
lib-y += cmpxchg16b_emu.o
diff --git a/arch/x86/lib/clear_page_32.S b/arch/x86/lib/clear_page_32.S
new file mode 100644
index 000..9592161
--- /dev/null
+++ b/arch/x86/lib/clear_page_32.S
@@ -0,0 +1,72 @@
+#include 
+#include 
+#include 
+#include 
+
+/*
+ * Fallback version if SSE2 is not avaible.
+ */
+ENTRY(clear_page_nocache)
+   CFI_STARTPROC
+   mov%eax,%edx
+   xorl   %eax,%eax
+   movl   $4096/32,%ecx
+   .p2align 4
+.Lloop:
+   decl%ecx
+#define PUT(x) mov %eax,x*4(%edx)
+   PUT(0)
+   PUT(1)
+   PUT(2)
+   PUT(3)
+   PUT(4)
+   PUT(5)
+   PUT(6)
+   PUT(7)
+#undef PUT
+   lea 32(%edx),%edx
+   jnz .Lloop
+   nop
+   ret
+   CFI_ENDPROC
+ENDPROC(clear_page_nocache)
+
+   .section .altinstr_replacement,"ax"
+1:  .byte 0xeb /* jmp  */
+   .byte (clear_page_nocache_sse2 - clear_page_nocache) - (2f - 1b)
+   /* offset */
+2:
+   .previous
+   .section .altinstructions,"a"
+   altinstruction_entry clear_page_nocache,1b,X86_FEATURE_XMM2,\
+   16, 2b-1b
+   .previous
+
+/*
+ * Zero a page avoiding the caches
+ * eax page
+ */
+ENTRY(clear_page_nocache_sse2)
+   CFI_STARTPROC
+   mov%eax,%edx
+   

[PATCH v3 6/7] mm: make clear_huge_page cache clear only around the fault address

2012-08-16 Thread Kirill A. Shutemov
From: Andi Kleen 

Clearing a 2MB huge page will typically blow away several levels
of CPU caches. To avoid this only cache clear the 4K area
around the fault address and use a cache avoiding clears
for the rest of the 2MB area.

Signed-off-by: Andi Kleen 
Signed-off-by: Kirill A. Shutemov 
---
 mm/memory.c |   34 +-
 1 files changed, 29 insertions(+), 5 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index dfc179b..d4626b9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3969,18 +3969,34 @@ EXPORT_SYMBOL(might_fault);
 #endif
 
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
+
+#ifndef ARCH_HAS_USER_NOCACHE
+#define ARCH_HAS_USER_NOCACHE 0
+#endif
+
+#if ARCH_HAS_USER_NOCACHE == 0
+#define clear_user_highpage_nocache clear_user_highpage
+#endif
+
 static void clear_gigantic_page(struct page *page,
-   unsigned long addr,
-   unsigned int pages_per_huge_page)
+   unsigned long haddr, unsigned long fault_address,
+   unsigned int pages_per_huge_page)
 {
int i;
struct page *p = page;
+   unsigned long vaddr;
+   int target = (fault_address - haddr) >> PAGE_SHIFT;
 
might_sleep();
+   vaddr = haddr;
for (i = 0; i < pages_per_huge_page;
 i++, p = mem_map_next(p, page, i)) {
cond_resched();
-   clear_user_highpage(p, addr + i * PAGE_SIZE);
+   vaddr = haddr + i*PAGE_SIZE;
+   if (!ARCH_HAS_USER_NOCACHE  || i == target)
+   clear_user_highpage(p, vaddr);
+   else
+   clear_user_highpage_nocache(p, vaddr);
}
 }
 void clear_huge_page(struct page *page,
@@ -3988,16 +4004,24 @@ void clear_huge_page(struct page *page,
 unsigned int pages_per_huge_page)
 {
int i;
+   unsigned long vaddr;
+   int target = (fault_address - haddr) >> PAGE_SHIFT;
 
if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) {
-   clear_gigantic_page(page, haddr, pages_per_huge_page);
+   clear_gigantic_page(page, haddr, fault_address,
+   pages_per_huge_page);
return;
}
 
might_sleep();
+   vaddr = haddr;
for (i = 0; i < pages_per_huge_page; i++) {
cond_resched();
-   clear_user_highpage(page + i, haddr + i * PAGE_SIZE);
+   vaddr = haddr + i*PAGE_SIZE;
+   if (!ARCH_HAS_USER_NOCACHE || i == target)
+   clear_user_highpage(page + i, vaddr);
+   else
+   clear_user_highpage_nocache(page + i, vaddr);
}
 }
 
-- 
1.7.7.6

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v3 6/7] mm: make clear_huge_page cache clear only around the fault address

2012-08-16 Thread Kirill A. Shutemov
On Thu, Aug 16, 2012 at 06:16:47PM +0200, Andrea Arcangeli wrote:
> Hi Kirill,
> 
> On Thu, Aug 16, 2012 at 06:15:53PM +0300, Kirill A. Shutemov wrote:
> > for (i = 0; i < pages_per_huge_page;
> >  i++, p = mem_map_next(p, page, i)) {
> 
> It may be more optimal to avoid a multiplication/shiftleft before the
> add, and to do:
> 
>   for (i = 0, vaddr = haddr; i < pages_per_huge_page;
>i++, p = mem_map_next(p, page, i), vaddr += PAGE_SIZE) {
> 

Makes sense. I'll update it.

> > cond_resched();
> > -   clear_user_highpage(p, addr + i * PAGE_SIZE);
> > +   vaddr = haddr + i*PAGE_SIZE;
> 
> Not sure if gcc can optimize it away because of the external calls.
> 
> > +   if (!ARCH_HAS_USER_NOCACHE || i == target)
> > +   clear_user_highpage(page + i, vaddr);
> > +   else
> > +   clear_user_highpage_nocache(page + i, vaddr);
> > }
> 
> 
> My only worry overall is if there can be some workload where this may
> actually slow down userland if the CPU cache is very large and
> userland would access most of the faulted in memory after the first
> fault.
> 
> So I wouldn't mind to add one more check in addition of
> !ARCH_HAS_USER_NOCACHE above to check a runtime sysctl variable. It'll
> waste a cacheline yes but I doubt it's measurable compared to the time
> it takes to do a >=2M hugepage copy.

Hm.. I think with static_key we can avoid cache overhead here. I'll try.
 
> Furthermore it would allow people to benchmark its effect without
> having to rebuild the kernel themself.
> 
> All other patches looks fine to me.

Thanks, for review. Could you take a look at huge zero page patchset? ;)

-- 
 Kirill A. Shutemov
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v3 6/7] mm: make clear_huge_page cache clear only around the fault address

2012-08-16 Thread Kirill A. Shutemov
On Thu, Aug 16, 2012 at 08:29:44PM +0200, Andrea Arcangeli wrote:
> On Thu, Aug 16, 2012 at 07:43:56PM +0300, Kirill A. Shutemov wrote:
> > Hm.. I think with static_key we can avoid cache overhead here. I'll try.
> 
> Could you elaborate on the static_key? Is it some sort of self
> modifying code?

Runtime code patching. See Documentation/static-keys.txt. We can patch it
on sysctl.

> 
> > Thanks, for review. Could you take a look at huge zero page patchset? ;)
> 
> I've noticed that too, nice :). I'm checking some detail on the
> wrprotect fault behavior but I'll comment there.
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org";> em...@kvack.org 

-- 
 Kirill A. Shutemov
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v4 0/8] Avoid cache trashing on clearing huge/gigantic page

2012-08-20 Thread Kirill A. Shutemov
From: "Kirill A. Shutemov" 

Clearing a 2MB huge page will typically blow away several levels of CPU
caches.  To avoid this only cache clear the 4K area around the fault
address and use a cache avoiding clears for the rest of the 2MB area.

This patchset implements cache avoiding version of clear_page only for
x86. If an architecture wants to provide cache avoiding version of
clear_page it should to define ARCH_HAS_USER_NOCACHE to 1 and implement
clear_page_nocache() and clear_user_highpage_nocache().

v4:
  - vm.clear_huge_page_nocache sysctl;
  - rework page iteration in clear_{huge,gigantic}_page according to
Andrea Arcangeli suggestion;
v3:
  - Rebased to current Linus' tree. kmap_atomic() build issue is fixed;
  - Pass fault address to clear_huge_page(). v2 had problem with clearing
for sizes other than HPAGE_SIZE;
  - x86: fix 32bit variant. Fallback version of clear_page_nocache() has
been added for non-SSE2 systems;
  - x86: clear_page_nocache() moved to clear_page_{32,64}.S;
  - x86: use pushq_cfi/popq_cfi instead of push/pop;
v2:
  - No code change. Only commit messages are updated;
  - RFC mark is dropped;

Andi Kleen (5):
  THP: Use real address for NUMA policy
  THP: Pass fault address to __do_huge_pmd_anonymous_page()
  x86: Add clear_page_nocache
  mm: make clear_huge_page cache clear only around the fault address
  x86: switch the 64bit uncached page clear to SSE/AVX v2

Kirill A. Shutemov (3):
  hugetlb: pass fault address to hugetlb_no_page()
  mm: pass fault address to clear_huge_page()
  mm: implement vm.clear_huge_page_nocache sysctl

 Documentation/sysctl/vm.txt  |   13 ++
 arch/x86/include/asm/page.h  |2 +
 arch/x86/include/asm/string_32.h |5 ++
 arch/x86/include/asm/string_64.h |5 ++
 arch/x86/lib/Makefile|3 +-
 arch/x86/lib/clear_page_32.S |   72 +++
 arch/x86/lib/clear_page_64.S |   78 ++
 arch/x86/mm/fault.c  |7 +++
 include/linux/mm.h   |7 +++-
 kernel/sysctl.c  |   12 ++
 mm/huge_memory.c |   17 
 mm/hugetlb.c |   39 ++-
 mm/memory.c  |   72 ++
 13 files changed, 294 insertions(+), 38 deletions(-)
 create mode 100644 arch/x86/lib/clear_page_32.S

-- 
1.7.7.6

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v4 1/8] THP: Use real address for NUMA policy

2012-08-20 Thread Kirill A. Shutemov
From: Andi Kleen 

Use the fault address, not the rounded down hpage address for NUMA
policy purposes. In some circumstances this can give more exact
NUMA policy.

Signed-off-by: Andi Kleen 
Signed-off-by: Kirill A. Shutemov 
---
 mm/huge_memory.c |8 
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 57c4b93..70737ec 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -681,11 +681,11 @@ static inline gfp_t alloc_hugepage_gfpmask(int defrag, 
gfp_t extra_gfp)
 
 static inline struct page *alloc_hugepage_vma(int defrag,
  struct vm_area_struct *vma,
- unsigned long haddr, int nd,
+ unsigned long address, int nd,
  gfp_t extra_gfp)
 {
return alloc_pages_vma(alloc_hugepage_gfpmask(defrag, extra_gfp),
-  HPAGE_PMD_ORDER, vma, haddr, nd);
+  HPAGE_PMD_ORDER, vma, address, nd);
 }
 
 #ifndef CONFIG_NUMA
@@ -710,7 +710,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
if (unlikely(khugepaged_enter(vma)))
return VM_FAULT_OOM;
page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
- vma, haddr, numa_node_id(), 0);
+ vma, address, numa_node_id(), 0);
if (unlikely(!page)) {
count_vm_event(THP_FAULT_FALLBACK);
goto out;
@@ -944,7 +944,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
if (transparent_hugepage_enabled(vma) &&
!transparent_hugepage_debug_cow())
new_page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
- vma, haddr, numa_node_id(), 0);
+ vma, address, numa_node_id(), 0);
else
new_page = NULL;
 
-- 
1.7.7.6

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v4 3/8] hugetlb: pass fault address to hugetlb_no_page()

2012-08-20 Thread Kirill A. Shutemov
From: "Kirill A. Shutemov" 

Signed-off-by: Kirill A. Shutemov 
---
 mm/hugetlb.c |   38 +++---
 1 files changed, 19 insertions(+), 19 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index bc72712..3c86d3d 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2672,7 +2672,8 @@ static bool hugetlbfs_pagecache_present(struct hstate *h,
 }
 
 static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
-   unsigned long address, pte_t *ptep, unsigned int flags)
+   unsigned long haddr, unsigned long fault_address,
+   pte_t *ptep, unsigned int flags)
 {
struct hstate *h = hstate_vma(vma);
int ret = VM_FAULT_SIGBUS;
@@ -2696,7 +2697,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
}
 
mapping = vma->vm_file->f_mapping;
-   idx = vma_hugecache_offset(h, vma, address);
+   idx = vma_hugecache_offset(h, vma, haddr);
 
/*
 * Use page lock to guard against racing truncation
@@ -2708,7 +2709,7 @@ retry:
size = i_size_read(mapping->host) >> huge_page_shift(h);
if (idx >= size)
goto out;
-   page = alloc_huge_page(vma, address, 0);
+   page = alloc_huge_page(vma, haddr, 0);
if (IS_ERR(page)) {
ret = PTR_ERR(page);
if (ret == -ENOMEM)
@@ -2717,7 +2718,7 @@ retry:
ret = VM_FAULT_SIGBUS;
goto out;
}
-   clear_huge_page(page, address, pages_per_huge_page(h));
+   clear_huge_page(page, haddr, pages_per_huge_page(h));
__SetPageUptodate(page);
 
if (vma->vm_flags & VM_MAYSHARE) {
@@ -2763,7 +2764,7 @@ retry:
 * the spinlock.
 */
if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED))
-   if (vma_needs_reservation(h, vma, address) < 0) {
+   if (vma_needs_reservation(h, vma, haddr) < 0) {
ret = VM_FAULT_OOM;
goto backout_unlocked;
}
@@ -2778,16 +2779,16 @@ retry:
goto backout;
 
if (anon_rmap)
-   hugepage_add_new_anon_rmap(page, vma, address);
+   hugepage_add_new_anon_rmap(page, vma, haddr);
else
page_dup_rmap(page);
new_pte = make_huge_pte(vma, page, ((vma->vm_flags & VM_WRITE)
&& (vma->vm_flags & VM_SHARED)));
-   set_huge_pte_at(mm, address, ptep, new_pte);
+   set_huge_pte_at(mm, haddr, ptep, new_pte);
 
if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
/* Optimization, do the COW without a second fault */
-   ret = hugetlb_cow(mm, vma, address, ptep, new_pte, page);
+   ret = hugetlb_cow(mm, vma, haddr, ptep, new_pte, page);
}
 
spin_unlock(&mm->page_table_lock);
@@ -2813,21 +2814,20 @@ int hugetlb_fault(struct mm_struct *mm, struct 
vm_area_struct *vma,
struct page *pagecache_page = NULL;
static DEFINE_MUTEX(hugetlb_instantiation_mutex);
struct hstate *h = hstate_vma(vma);
+   unsigned long haddr = address & huge_page_mask(h);
 
-   address &= huge_page_mask(h);
-
-   ptep = huge_pte_offset(mm, address);
+   ptep = huge_pte_offset(mm, haddr);
if (ptep) {
entry = huge_ptep_get(ptep);
if (unlikely(is_hugetlb_entry_migration(entry))) {
-   migration_entry_wait(mm, (pmd_t *)ptep, address);
+   migration_entry_wait(mm, (pmd_t *)ptep, haddr);
return 0;
} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
return VM_FAULT_HWPOISON_LARGE |
VM_FAULT_SET_HINDEX(hstate_index(h));
}
 
-   ptep = huge_pte_alloc(mm, address, huge_page_size(h));
+   ptep = huge_pte_alloc(mm, haddr, huge_page_size(h));
if (!ptep)
return VM_FAULT_OOM;
 
@@ -2839,7 +2839,7 @@ int hugetlb_fault(struct mm_struct *mm, struct 
vm_area_struct *vma,
mutex_lock(&hugetlb_instantiation_mutex);
entry = huge_ptep_get(ptep);
if (huge_pte_none(entry)) {
-   ret = hugetlb_no_page(mm, vma, address, ptep, flags);
+   ret = hugetlb_no_page(mm, vma, haddr, address, ptep, flags);
goto out_mutex;
}
 
@@ -2854,14 +2854,14 @@ int hugetlb_fault(struct mm_struct *mm, struct 
vm_area_struct *vma,
 * consumed.
 */
if ((flags & FAULT_FLAG_WRITE) && !pte_write(entry)) {
-   if (vma_needs_reservatio

[PATCH v4 4/8] mm: pass fault address to clear_huge_page()

2012-08-20 Thread Kirill A. Shutemov
From: "Kirill A. Shutemov" 

Signed-off-by: Kirill A. Shutemov 
---
 include/linux/mm.h |2 +-
 mm/huge_memory.c   |2 +-
 mm/hugetlb.c   |3 ++-
 mm/memory.c|7 ---
 4 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 311be90..2858723 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1638,7 +1638,7 @@ extern void dump_page(struct page *page);
 
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
 extern void clear_huge_page(struct page *page,
-   unsigned long addr,
+   unsigned long haddr, unsigned long fault_address,
unsigned int pages_per_huge_page);
 extern void copy_user_huge_page(struct page *dst, struct page *src,
unsigned long addr, struct vm_area_struct *vma,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 6f0825b611..070bf89 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -644,7 +644,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct 
*mm,
if (unlikely(!pgtable))
return VM_FAULT_OOM;
 
-   clear_huge_page(page, haddr, HPAGE_PMD_NR);
+   clear_huge_page(page, haddr, address, HPAGE_PMD_NR);
__SetPageUptodate(page);
 
spin_lock(&mm->page_table_lock);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3c86d3d..5182192 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2718,7 +2718,8 @@ retry:
ret = VM_FAULT_SIGBUS;
goto out;
}
-   clear_huge_page(page, haddr, pages_per_huge_page(h));
+   clear_huge_page(page, haddr, fault_address,
+   pages_per_huge_page(h));
__SetPageUptodate(page);
 
if (vma->vm_flags & VM_MAYSHARE) {
diff --git a/mm/memory.c b/mm/memory.c
index 5736170..dfc179b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3984,19 +3984,20 @@ static void clear_gigantic_page(struct page *page,
}
 }
 void clear_huge_page(struct page *page,
-unsigned long addr, unsigned int pages_per_huge_page)
+unsigned long haddr, unsigned long fault_address,
+unsigned int pages_per_huge_page)
 {
int i;
 
if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) {
-   clear_gigantic_page(page, addr, pages_per_huge_page);
+   clear_gigantic_page(page, haddr, pages_per_huge_page);
return;
}
 
might_sleep();
for (i = 0; i < pages_per_huge_page; i++) {
cond_resched();
-   clear_user_highpage(page + i, addr + i * PAGE_SIZE);
+   clear_user_highpage(page + i, haddr + i * PAGE_SIZE);
}
 }
 
-- 
1.7.7.6

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v4 2/8] THP: Pass fault address to __do_huge_pmd_anonymous_page()

2012-08-20 Thread Kirill A. Shutemov
From: Andi Kleen 

Signed-off-by: Andi Kleen 
Signed-off-by: Kirill A. Shutemov 
---
 mm/huge_memory.c |7 ---
 1 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 70737ec..6f0825b611 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -633,7 +633,8 @@ static inline pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct 
vm_area_struct *vma)
 
 static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
struct vm_area_struct *vma,
-   unsigned long haddr, pmd_t *pmd,
+   unsigned long haddr,
+   unsigned long address, pmd_t *pmd,
struct page *page)
 {
pgtable_t pgtable;
@@ -720,8 +721,8 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
put_page(page);
goto out;
}
-   if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr, pmd,
- page))) {
+   if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr,
+   address, pmd, page))) {
mem_cgroup_uncharge_page(page);
put_page(page);
goto out;
-- 
1.7.7.6

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v4 6/8] mm: make clear_huge_page cache clear only around the fault address

2012-08-20 Thread Kirill A. Shutemov
From: Andi Kleen 

Clearing a 2MB huge page will typically blow away several levels
of CPU caches. To avoid this only cache clear the 4K area
around the fault address and use a cache avoiding clears
for the rest of the 2MB area.

Signed-off-by: Andi Kleen 
Signed-off-by: Kirill A. Shutemov 
---
 mm/memory.c |   37 +
 1 files changed, 29 insertions(+), 8 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index dfc179b..625ca33 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3969,18 +3969,32 @@ EXPORT_SYMBOL(might_fault);
 #endif
 
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
+
+#ifndef ARCH_HAS_USER_NOCACHE
+#define ARCH_HAS_USER_NOCACHE 0
+#endif
+
+#if ARCH_HAS_USER_NOCACHE == 0
+#define clear_user_highpage_nocache clear_user_highpage
+#endif
+
 static void clear_gigantic_page(struct page *page,
-   unsigned long addr,
-   unsigned int pages_per_huge_page)
+   unsigned long haddr, unsigned long fault_address,
+   unsigned int pages_per_huge_page)
 {
int i;
struct page *p = page;
+   unsigned long vaddr;
+   int target = (fault_address - haddr) >> PAGE_SHIFT;
 
might_sleep();
-   for (i = 0; i < pages_per_huge_page;
-i++, p = mem_map_next(p, page, i)) {
+   for (i = 0, vaddr = haddr; i < pages_per_huge_page;
+   i++, p = mem_map_next(p, page, i), vaddr += PAGE_SIZE) {
cond_resched();
-   clear_user_highpage(p, addr + i * PAGE_SIZE);
+   if (!ARCH_HAS_USER_NOCACHE  || i == target)
+   clear_user_highpage(p, vaddr);
+   else
+   clear_user_highpage_nocache(p, vaddr);
}
 }
 void clear_huge_page(struct page *page,
@@ -3988,16 +4002,23 @@ void clear_huge_page(struct page *page,
 unsigned int pages_per_huge_page)
 {
int i;
+   unsigned long vaddr;
+   int target = (fault_address - haddr) >> PAGE_SHIFT;
 
if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) {
-   clear_gigantic_page(page, haddr, pages_per_huge_page);
+   clear_gigantic_page(page, haddr, fault_address,
+   pages_per_huge_page);
return;
}
 
might_sleep();
-   for (i = 0; i < pages_per_huge_page; i++) {
+   for (i = 0, vaddr = haddr; i < pages_per_huge_page;
+   i++, page++, vaddr += PAGE_SIZE) {
cond_resched();
-   clear_user_highpage(page + i, haddr + i * PAGE_SIZE);
+   if (!ARCH_HAS_USER_NOCACHE || i == target)
+   clear_user_highpage(page, vaddr);
+   else
+   clear_user_highpage_nocache(page, vaddr);
}
 }
 
-- 
1.7.7.6

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v4 7/8] x86: switch the 64bit uncached page clear to SSE/AVX v2

2012-08-20 Thread Kirill A. Shutemov
From: Andi Kleen 

With multiple threads vector stores are more efficient, so use them.
This will cause the page clear to run non preemptable and add some
overhead. However on 32bit it was already non preempable (due to
kmap_atomic) and there is an preemption opportunity every 4K unit.

On a NPB (Nasa Parallel Benchmark) 128GB run on a Westmere this improves
the performance regression of enabling transparent huge pages
by ~2% (2.81% to 0.81%), near the runtime variability now.
On a system with AVX support more is expected.

Signed-off-by: Andi Kleen 
[kirill.shute...@linux.intel.com: Properly save/restore arguments]
Signed-off-by: Kirill A. Shutemov 
---
 arch/x86/lib/clear_page_64.S |   79 ++
 1 files changed, 64 insertions(+), 15 deletions(-)

diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
index 9d2f3c2..b302cff 100644
--- a/arch/x86/lib/clear_page_64.S
+++ b/arch/x86/lib/clear_page_64.S
@@ -73,30 +73,79 @@ ENDPROC(clear_page)
 .Lclear_page_end-clear_page,3b-2b
.previous
 
+#define SSE_UNROLL 128
+
 /*
  * Zero a page avoiding the caches
  * rdi page
  */
 ENTRY(clear_page_nocache)
CFI_STARTPROC
-   xorl   %eax,%eax
-   movl   $4096/64,%ecx
+   pushq_cfi %rdi
+   call   kernel_fpu_begin
+   popq_cfi  %rdi
+   sub$16,%rsp
+   CFI_ADJUST_CFA_OFFSET 16
+   movdqu %xmm0,(%rsp)
+   xorpd  %xmm0,%xmm0
+   movl   $4096/SSE_UNROLL,%ecx
.p2align 4
 .Lloop_nocache:
decl%ecx
-#define PUT(x) movnti %rax,x*8(%rdi)
-   movnti %rax,(%rdi)
-   PUT(1)
-   PUT(2)
-   PUT(3)
-   PUT(4)
-   PUT(5)
-   PUT(6)
-   PUT(7)
-#undef PUT
-   leaq64(%rdi),%rdi
+   .set x,0
+   .rept SSE_UNROLL/16
+   movntdq %xmm0,x(%rdi)
+   .set x,x+16
+   .endr
+   leaqSSE_UNROLL(%rdi),%rdi
jnz .Lloop_nocache
-   nop
-   ret
+   movdqu (%rsp),%xmm0
+   addq   $16,%rsp
+   CFI_ADJUST_CFA_OFFSET -16
+   jmp   kernel_fpu_end
CFI_ENDPROC
 ENDPROC(clear_page_nocache)
+
+#ifdef CONFIG_AS_AVX
+
+   .section .altinstr_replacement,"ax"
+1: .byte 0xeb  /* jmp  */
+   .byte (clear_page_nocache_avx - clear_page_nocache) - (2f - 1b)
+   /* offset */
+2:
+   .previous
+   .section .altinstructions,"a"
+   altinstruction_entry clear_page_nocache,1b,X86_FEATURE_AVX,\
+16, 2b-1b
+   .previous
+
+#define AVX_UNROLL 256 /* TUNE ME */
+
+ENTRY(clear_page_nocache_avx)
+   CFI_STARTPROC
+   pushq_cfi %rdi
+   call   kernel_fpu_begin
+   popq_cfi  %rdi
+   sub$32,%rsp
+   CFI_ADJUST_CFA_OFFSET 32
+   vmovdqu %ymm0,(%rsp)
+   vxorpd  %ymm0,%ymm0,%ymm0
+   movl   $4096/AVX_UNROLL,%ecx
+   .p2align 4
+.Lloop_avx:
+   decl%ecx
+   .set x,0
+   .rept AVX_UNROLL/32
+   vmovntdq %ymm0,x(%rdi)
+   .set x,x+32
+   .endr
+   leaqAVX_UNROLL(%rdi),%rdi
+   jnz .Lloop_avx
+   vmovdqu (%rsp),%ymm0
+   addq   $32,%rsp
+   CFI_ADJUST_CFA_OFFSET -32
+   jmp   kernel_fpu_end
+   CFI_ENDPROC
+ENDPROC(clear_page_nocache_avx)
+
+#endif
-- 
1.7.7.6

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v4 8/8] mm: implement vm.clear_huge_page_nocache sysctl

2012-08-20 Thread Kirill A. Shutemov
From: "Kirill A. Shutemov" 

In some cases cache avoiding clearing huge page may slow down workload.
Let's provide an sysctl handle to disable it.

We use static_key here to avoid extra work on fast path.

Signed-off-by: Kirill A. Shutemov 
---
 Documentation/sysctl/vm.txt |   13 
 include/linux/mm.h  |5 
 kernel/sysctl.c |   12 +++
 mm/memory.c |   44 +-
 4 files changed, 68 insertions(+), 6 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 078701f..9559a97 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -19,6 +19,7 @@ files can be found in mm/swap.c.
 Currently, these files are in /proc/sys/vm:
 
 - block_dump
+- clear_huge_page_nocache
 - compact_memory
 - dirty_background_bytes
 - dirty_background_ratio
@@ -74,6 +75,18 @@ huge pages although processes will also directly compact 
memory as required.
 
 ==
 
+clear_huge_page_nocache
+
+Available only when the architecture provides ARCH_HAS_USER_NOCACHE and
+CONFIG_TRANSPARENT_HUGEPAGE or CONFIG_HUGETLBFS is set.
+
+When set to 1 (default) kernel will use cache avoiding clear routine for
+clearing huge pages. This minimize cache pollution.
+When set to 0 kernel will clear huge pages through cache. This may speed
+up some workloads. Also it's useful for benchmarking propose.
+
+==
+
 dirty_background_bytes
 
 Contains the amount of dirty memory at which the background kernel
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2858723..9b48f43 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1643,6 +1643,11 @@ extern void clear_huge_page(struct page *page,
 extern void copy_user_huge_page(struct page *dst, struct page *src,
unsigned long addr, struct vm_area_struct *vma,
unsigned int pages_per_huge_page);
+#ifdef ARCH_HAS_USER_NOCACHE
+extern int sysctl_clear_huge_page_nocache;
+extern int clear_huge_page_nocache_handler(struct ctl_table *table, int write,
+   void __user *buffer, size_t *length, loff_t *ppos);
+#endif
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS */
 
 #ifdef CONFIG_DEBUG_PAGEALLOC
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 87174ef..80ccc67 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1366,6 +1366,18 @@ static struct ctl_table vm_table[] = {
.extra2 = &one,
},
 #endif
+#if defined(ARCH_HAS_USER_NOCACHE) && \
+   (defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS))
+   {
+   .procname   = "clear_huge_page_nocache",
+   .data   = &sysctl_clear_huge_page_nocache,
+   .maxlen = sizeof(sysctl_clear_huge_page_nocache),
+   .mode   = 0644,
+   .proc_handler   = clear_huge_page_nocache_handler,
+   .extra1 = &zero,
+   .extra2 = &one,
+   },
+#endif
{ }
 };
 
diff --git a/mm/memory.c b/mm/memory.c
index 625ca33..395d574 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -57,6 +57,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -3970,12 +3971,43 @@ EXPORT_SYMBOL(might_fault);
 
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
 
-#ifndef ARCH_HAS_USER_NOCACHE
-#define ARCH_HAS_USER_NOCACHE 0
-#endif
+#ifdef ARCH_HAS_USER_NOCACHE
+int sysctl_clear_huge_page_nocache = 1;
+static DEFINE_MUTEX(sysctl_clear_huge_page_nocache_lock);
+static struct static_key clear_huge_page_nocache __read_mostly =
+   STATIC_KEY_INIT_TRUE;
 
-#if ARCH_HAS_USER_NOCACHE == 0
+static inline int is_nocache_enabled(void)
+{
+   return static_key_true(&clear_huge_page_nocache);
+}
+
+int clear_huge_page_nocache_handler(struct ctl_table *table, int write,
+   void __user *buffer, size_t *length, loff_t *ppos)
+{
+   int orig_value = sysctl_clear_huge_page_nocache;
+   int ret;
+
+   mutex_lock(&sysctl_clear_huge_page_nocache_lock);
+   orig_value = sysctl_clear_huge_page_nocache;
+   ret = proc_dointvec_minmax(table, write, buffer, length, ppos);
+   if (!ret && write && sysctl_clear_huge_page_nocache != orig_value) {
+   if (sysctl_clear_huge_page_nocache)
+   static_key_slow_inc(&clear_huge_page_nocache);
+   else
+   static_key_slow_dec(&clear_huge_page_nocache);
+   }
+   mutex_unlock(&sysctl_clear_huge_page_nocache_lock);
+
+   return ret;
+}
+#else
 #define clear_user_highpage_nocache clear_user_highpage
+
+static inline int is_nocache_enabled(void)
+{
+   return 0;
+}
 #endif
 
 static void clear_gigantic_page(

[PATCH v4 5/8] x86: Add clear_page_nocache

2012-08-20 Thread Kirill A. Shutemov
From: Andi Kleen 

Add a cache avoiding version of clear_page. Straight forward integer variant
of the existing 64bit clear_page, for both 32bit and 64bit.

Also add the necessary glue for highmem including a layer that non cache
coherent architectures that use the virtual address for flushing can
hook in. This is not needed on x86 of course.

If an architecture wants to provide cache avoiding version of clear_page
it should to define ARCH_HAS_USER_NOCACHE to 1 and implement
clear_page_nocache() and clear_user_highpage_nocache().

Signed-off-by: Andi Kleen 
Signed-off-by: Kirill A. Shutemov 
---
 arch/x86/include/asm/page.h  |2 +
 arch/x86/include/asm/string_32.h |5 +++
 arch/x86/include/asm/string_64.h |5 +++
 arch/x86/lib/Makefile|3 +-
 arch/x86/lib/clear_page_32.S |   72 ++
 arch/x86/lib/clear_page_64.S |   29 +++
 arch/x86/mm/fault.c  |7 
 7 files changed, 122 insertions(+), 1 deletions(-)
 create mode 100644 arch/x86/lib/clear_page_32.S

diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
index 8ca8283..aa83a1b 100644
--- a/arch/x86/include/asm/page.h
+++ b/arch/x86/include/asm/page.h
@@ -29,6 +29,8 @@ static inline void copy_user_page(void *to, void *from, 
unsigned long vaddr,
copy_page(to, from);
 }
 
+void clear_user_highpage_nocache(struct page *page, unsigned long vaddr);
+
 #define __alloc_zeroed_user_highpage(movableflags, vma, vaddr) \
alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO | movableflags, vma, vaddr)
 #define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
diff --git a/arch/x86/include/asm/string_32.h b/arch/x86/include/asm/string_32.h
index 3d3e835..3f2fbcf 100644
--- a/arch/x86/include/asm/string_32.h
+++ b/arch/x86/include/asm/string_32.h
@@ -3,6 +3,8 @@
 
 #ifdef __KERNEL__
 
+#include 
+
 /* Let gcc decide whether to inline or use the out of line functions */
 
 #define __HAVE_ARCH_STRCPY
@@ -337,6 +339,9 @@ void *__constant_c_and_count_memset(void *s, unsigned long 
pattern,
 #define __HAVE_ARCH_MEMSCAN
 extern void *memscan(void *addr, int c, size_t size);
 
+#define ARCH_HAS_USER_NOCACHE 1
+asmlinkage void clear_page_nocache(void *page);
+
 #endif /* __KERNEL__ */
 
 #endif /* _ASM_X86_STRING_32_H */
diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index 19e2c46..ca23d1d 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -3,6 +3,8 @@
 
 #ifdef __KERNEL__
 
+#include 
+
 /* Written 2002 by Andi Kleen */
 
 /* Only used for special circumstances. Stolen from i386/string.h */
@@ -63,6 +65,9 @@ char *strcpy(char *dest, const char *src);
 char *strcat(char *dest, const char *src);
 int strcmp(const char *cs, const char *ct);
 
+#define ARCH_HAS_USER_NOCACHE 1
+asmlinkage void clear_page_nocache(void *page);
+
 #endif /* __KERNEL__ */
 
 #endif /* _ASM_X86_STRING_64_H */
diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile
index b00f678..14e47a2 100644
--- a/arch/x86/lib/Makefile
+++ b/arch/x86/lib/Makefile
@@ -23,6 +23,7 @@ lib-y += memcpy_$(BITS).o
 lib-$(CONFIG_SMP) += rwlock.o
 lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o
 lib-$(CONFIG_INSTRUCTION_DECODER) += insn.o inat.o
+lib-y += clear_page_$(BITS).o
 
 obj-y += msr.o msr-reg.o msr-reg-export.o
 
@@ -40,7 +41,7 @@ endif
 else
 obj-y += iomap_copy_64.o
 lib-y += csum-partial_64.o csum-copy_64.o csum-wrappers_64.o
-lib-y += thunk_64.o clear_page_64.o copy_page_64.o
+lib-y += thunk_64.o copy_page_64.o
 lib-y += memmove_64.o memset_64.o
 lib-y += copy_user_64.o copy_user_nocache_64.o
lib-y += cmpxchg16b_emu.o
diff --git a/arch/x86/lib/clear_page_32.S b/arch/x86/lib/clear_page_32.S
new file mode 100644
index 000..9592161
--- /dev/null
+++ b/arch/x86/lib/clear_page_32.S
@@ -0,0 +1,72 @@
+#include 
+#include 
+#include 
+#include 
+
+/*
+ * Fallback version if SSE2 is not avaible.
+ */
+ENTRY(clear_page_nocache)
+   CFI_STARTPROC
+   mov%eax,%edx
+   xorl   %eax,%eax
+   movl   $4096/32,%ecx
+   .p2align 4
+.Lloop:
+   decl%ecx
+#define PUT(x) mov %eax,x*4(%edx)
+   PUT(0)
+   PUT(1)
+   PUT(2)
+   PUT(3)
+   PUT(4)
+   PUT(5)
+   PUT(6)
+   PUT(7)
+#undef PUT
+   lea 32(%edx),%edx
+   jnz .Lloop
+   nop
+   ret
+   CFI_ENDPROC
+ENDPROC(clear_page_nocache)
+
+   .section .altinstr_replacement,"ax"
+1:  .byte 0xeb /* jmp  */
+   .byte (clear_page_nocache_sse2 - clear_page_nocache) - (2f - 1b)
+   /* offset */
+2:
+   .previous
+   .section .altinstructions,"a"
+   altinstruction_entry clear_page_nocache,1b,X86_FEATURE_XMM2,\
+   16, 2b-1b
+   .previous
+
+/*
+ * Zero a page avoiding the caches
+ * eax page
+ */
+ENTRY(clear_page_nocache_sse2)
+   CFI_STARTPROC
+   mov%eax,%edx
+   

Re: [PATCH v4 0/8] Avoid cache trashing on clearing huge/gigantic page

2012-09-12 Thread Kirill A. Shutemov
Hi,

Any feedback?

-- 
 Kirill A. Shutemov


signature.asc
Description: Digital signature
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v4 0/8] Avoid cache trashing on clearing huge/gigantic page

2012-09-25 Thread Kirill A. Shutemov
On Fri, Sep 14, 2012 at 07:52:10AM +0200, Ingo Molnar wrote:
> Without repeatable hard numbers such code just gets into the 
> kernel and bitrots there as new CPU generations come in - a few 
> years down the line the original decisions often degrade to pure 
> noise. We've been there, we've done that, we don't want to 
> repeat it.



Hard numbers are hard.
I've checked some workloads: Mosbench, NPB, specjvm2008. Most of time the
patchset doesn't show any difference (within run-to-run deviation).
On NPB it recovers THP regression, but it's probably not enough to make
decision.

It would be nice if somebody test the patchset on other system or
workload. Especially, if the configuration shows regression with
THP enabled.

-- 
 Kirill A. Shutemov


signature.asc
Description: Digital signature
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] powerpc: thp: Fix crash on mremap

2014-01-01 Thread Kirill A. Shutemov
On Wed, Jan 01, 2014 at 09:29:05PM +1100, Benjamin Herrenschmidt wrote:
> On Wed, 2014-01-01 at 15:23 +0530, Aneesh Kumar K.V wrote:
> > From: "Aneesh Kumar K.V" 
> > 
> > This patch fix the below crash
> > 
> > NIP [c004cee4] .__hash_page_thp+0x2a4/0x440
> > LR [c00439ac] .hash_page+0x18c/0x5e0
> > ...
> > Call Trace:
> > [c00736103c40] [1b00] 0x1b00(unreliable)
> > [437908.479693] [c00736103d50] [c00439ac] .hash_page+0x18c/0x5e0
> > [437908.479699] [c00736103e30] [c000924c] 
> > .do_hash_page+0x4c/0x58
> > 
> > On ppc64 we use the pgtable for storing the hpte slot information and
> > store address to the pgtable at a constant offset (PTRS_PER_PMD) from
> > pmd. On mremap, when we switch the pmd, we need to withdraw and deposit
> > the pgtable again, so that we find the pgtable at PTRS_PER_PMD offset
> > from new pmd.
> > 
> > We also want to move the withdraw and deposit before the set_pmd so
> > that, when page fault find the pmd as trans huge we can be sure that
> > pgtable can be located at the offset.
> > 
> > Signed-off-by: Aneesh Kumar K.V 
> > ---
> > NOTE:
> > For other archs we would just be removing the pgtable from the list and 
> > adding it back.
> > I didn't find an easy way to make it not do that without lots of #ifdef 
> > around. Any
> > suggestion around that is welcome.
> 
> What about
> 
> - if (new_ptl != old_ptl) {
> +   if (new_ptl != old_ptl || ARCH_THP_MOVE_PMD_ALWAYS_WITHDRAW) {
> 
> Or something similar ?

Looks sane to me. Or something with IS_ENABLED(), if needed.

> 
> Cheers,
> Ben.
> 
> >  mm/huge_memory.c | 21 ++---
> >  1 file changed, 10 insertions(+), 11 deletions(-)
> > 
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 7de1bf85f683..eb2e60d9ba45 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1500,24 +1500,23 @@ int move_huge_pmd(struct vm_area_struct *vma, 
> > struct vm_area_struct *new_vma,
> >  */
> > ret = __pmd_trans_huge_lock(old_pmd, vma, &old_ptl);
> > if (ret == 1) {
> > +   pgtable_t pgtable;
> > +
> > new_ptl = pmd_lockptr(mm, new_pmd);
> > if (new_ptl != old_ptl)
> > spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
> > pmd = pmdp_get_and_clear(mm, old_addr, old_pmd);
> > VM_BUG_ON(!pmd_none(*new_pmd));
> > +   /*
> > +* Archs like ppc64 use pgtable to store per pmd
> > +* specific information. So when we switch the pmd,
> > +* we should also withdraw and deposit the pgtable
> > +*/
> > +   pgtable = pgtable_trans_huge_withdraw(mm, old_pmd);
> > +   pgtable_trans_huge_deposit(mm, new_pmd, pgtable);
> > set_pmd_at(mm, new_addr, new_pmd, pmd_mksoft_dirty(pmd));
> > -   if (new_ptl != old_ptl) {
> > -   pgtable_t pgtable;
> > -
> > -   /*
> > -* Move preallocated PTE page table if new_pmd is on
> > -* different PMD page table.
> > -*/

Please don't lose the comment.

> > -   pgtable = pgtable_trans_huge_withdraw(mm, old_pmd);
> > -       pgtable_trans_huge_deposit(mm, new_pmd, pgtable);
> > -
> > +   if (new_ptl != old_ptl)
> > spin_unlock(new_ptl);
> > -   }
> > spin_unlock(old_ptl);
> > }
> >  out:
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org";> em...@kvack.org 

-- 
 Kirill A. Shutemov
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [PATCH -V2] powerpc: thp: Fix crash on mremap

2014-01-02 Thread Kirill A. Shutemov
Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V" 
> 
> This patch fix the below crash
> 
> NIP [c004cee4] .__hash_page_thp+0x2a4/0x440
> LR [c00439ac] .hash_page+0x18c/0x5e0
> ...
> Call Trace:
> [c00736103c40] [1b00] 0x1b00(unreliable)
> [437908.479693] [c00736103d50] [c00439ac] .hash_page+0x18c/0x5e0
> [437908.479699] [c00736103e30] [c000924c] .do_hash_page+0x4c/0x58
> 
> On ppc64 we use the pgtable for storing the hpte slot information and
> store address to the pgtable at a constant offset (PTRS_PER_PMD) from
> pmd. On mremap, when we switch the pmd, we need to withdraw and deposit
> the pgtable again, so that we find the pgtable at PTRS_PER_PMD offset
> from new pmd.
> 
> We also want to move the withdraw and deposit before the set_pmd so
> that, when page fault find the pmd as trans huge we can be sure that
> pgtable can be located at the offset.
> 
> Signed-off-by: Aneesh Kumar K.V 
> ---
> Changes from V1:
> * limit the withraw/deposit to only ppc64
> 
>  arch/Kconfig   |  3 +++
>  arch/powerpc/platforms/Kconfig.cputype |  1 +
>  include/linux/huge_mm.h|  6 ++
>  mm/huge_memory.c   | 21 -
>  4 files changed, 22 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/Kconfig b/arch/Kconfig
> index f1cf895c040f..3759e70a649d 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -371,6 +371,9 @@ config HAVE_IRQ_TIME_ACCOUNTING
>  config HAVE_ARCH_TRANSPARENT_HUGEPAGE
>   bool
>  
> +config ARCH_THP_MOVE_PMD_ALWAYS_WITHDRAW

I don't like name of the option, but can't find any better... :(

> + bool
> +
>  config HAVE_ARCH_SOFT_DIRTY
>   bool
>  
> diff --git a/arch/powerpc/platforms/Kconfig.cputype 
> b/arch/powerpc/platforms/Kconfig.cputype
> index bca2465a9c34..5f83b4334e5f 100644
> --- a/arch/powerpc/platforms/Kconfig.cputype
> +++ b/arch/powerpc/platforms/Kconfig.cputype
> @@ -71,6 +71,7 @@ config PPC_BOOK3S_64
>   select PPC_FPU
>   select PPC_HAVE_PMU_SUPPORT
>   select SYS_SUPPORTS_HUGETLBFS
> + select ARCH_THP_MOVE_PMD_ALWAYS_WITHDRAW
>   select HAVE_ARCH_TRANSPARENT_HUGEPAGE if PPC_64K_PAGES
>  
>  config PPC_BOOK3E_64
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 91672e2deec3..836242a738a5 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -230,4 +230,10 @@ static inline int do_huge_pmd_numa_page(struct mm_struct 
> *mm, struct vm_area_str
>  
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
> +#ifdef CONFIG_ARCH_THP_MOVE_PMD_ALWAYS_WITHDRAW
> +#define ARCH_THP_MOVE_PMD_ALWAYS_WITHDRAW 1
> +#else
> +#define ARCH_THP_MOVE_PMD_ALWAYS_WITHDRAW 0
> +#endif
> +

Just use config option directly:

if (new_ptl != old_ptl ||
IS_ENABLED(CONFIG_ARCH_THP_MOVE_PMD_ALWAYS_WITHDRAW))
...


Otherwise, looks good:

Acked-by: Kirill A. Shutemov 

>  #endif /* _LINUX_HUGE_MM_H */
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 7de1bf85f683..32006b51d102 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1505,19 +1505,22 @@ int move_huge_pmd(struct vm_area_struct *vma, struct 
> vm_area_struct *new_vma,
>   spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
>   pmd = pmdp_get_and_clear(mm, old_addr, old_pmd);
>   VM_BUG_ON(!pmd_none(*new_pmd));
> - set_pmd_at(mm, new_addr, new_pmd, pmd_mksoft_dirty(pmd));
> - if (new_ptl != old_ptl) {
> + /*
> +  * Archs like ppc64 use pgtable to store per pmd
> +  * specific information. So when we switch the pmd,
> +  * we should also withdraw and deposit the pgtable
> +  *
> +  * With split pmd lock we also need to move preallocated
> +  * PTE page table if new_pmd is on different PMD page table.
> +  */
> + if (new_ptl != old_ptl || ARCH_THP_MOVE_PMD_ALWAYS_WITHDRAW) {
>   pgtable_t pgtable;
> -
> - /*
> -  * Move preallocated PTE page table if new_pmd is on
> -  * different PMD page table.
> -  */
>   pgtable = pgtable_trans_huge_withdraw(mm, old_pmd);
>   pgtable_trans_huge_deposit(mm, new_pmd, pgtable);
> -
> - spin_unlock(new_ptl);
>   }
> + set_pmd_at(mm, new_addr, new_pmd, pmd_mksoft_dirty(pmd));
> + if (new_ptl != old_ptl)
> + spin_unlock(new_ptl);
>   spin_unlock(old_ptl);
>   }
>  out:
> -- 
> 1.8.3.2

-- 
 Kirill A. Shutemov
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH V4] powerpc: thp: Fix crash on mremap

2014-01-13 Thread Kirill A. Shutemov
On Mon, Jan 13, 2014 at 11:34:24AM +0530, Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V" 
> 
> This patch fix the below crash
> 
> NIP [c004cee4] .__hash_page_thp+0x2a4/0x440
> LR [c00439ac] .hash_page+0x18c/0x5e0
> ...
> Call Trace:
> [c00736103c40] [1b00] 0x1b00(unreliable)
> [437908.479693] [c00736103d50] [c00439ac] .hash_page+0x18c/0x5e0
> [437908.479699] [c00736103e30] [c000924c] .do_hash_page+0x4c/0x58
> 
> On ppc64 we use the pgtable for storing the hpte slot information and
> store address to the pgtable at a constant offset (PTRS_PER_PMD) from
> pmd. On mremap, when we switch the pmd, we need to withdraw and deposit
> the pgtable again, so that we find the pgtable at PTRS_PER_PMD offset
> from new pmd.
> 
> We also want to move the withdraw and deposit before the set_pmd so
> that, when page fault find the pmd as trans huge we can be sure that
> pgtable can be located at the offset.
> 
> Signed-off-by: Aneesh Kumar K.V 

Acked-by: Kirill A. Shutemov 


-- 
 Kirill A. Shutemov
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] powerpc: thp: Fix crash on mremap

2014-01-13 Thread Kirill A. Shutemov
On Mon, Jan 13, 2014 at 02:17:48PM -0800, Andrew Morton wrote:
> On Thu, 2 Jan 2014 04:19:51 +0200 "Kirill A. Shutemov"  
> wrote:
> 
> > On Wed, Jan 01, 2014 at 09:29:05PM +1100, Benjamin Herrenschmidt wrote:
> > > On Wed, 2014-01-01 at 15:23 +0530, Aneesh Kumar K.V wrote:
> > > > From: "Aneesh Kumar K.V" 
> > > > 
> > > > This patch fix the below crash
> > > > 
> > > > NIP [c004cee4] .__hash_page_thp+0x2a4/0x440
> > > > LR [c00439ac] .hash_page+0x18c/0x5e0
> > > > ...
> > > > Call Trace:
> > > > [c00736103c40] [1b00] 0x1b00(unreliable)
> > > > [437908.479693] [c00736103d50] [c00439ac] 
> > > > .hash_page+0x18c/0x5e0
> > > > [437908.479699] [c00736103e30] [c000924c] 
> > > > .do_hash_page+0x4c/0x58
> > > > 
> > > > On ppc64 we use the pgtable for storing the hpte slot information and
> > > > store address to the pgtable at a constant offset (PTRS_PER_PMD) from
> > > > pmd. On mremap, when we switch the pmd, we need to withdraw and deposit
> > > > the pgtable again, so that we find the pgtable at PTRS_PER_PMD offset
> > > > from new pmd.
> > > > 
> > > > We also want to move the withdraw and deposit before the set_pmd so
> > > > that, when page fault find the pmd as trans huge we can be sure that
> > > > pgtable can be located at the offset.
> > > > 
> 
> Did this get fixed?

New version: http://thread.gmane.org/gmane.linux.kernel.mm/111809

-- 
 Kirill A. Shutemov
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 1/1] mm: move FAULT_AROUND_ORDER to arch/

2014-03-25 Thread Kirill A. Shutemov
On Tue, Mar 25, 2014 at 12:20:15PM +0530, Madhavan Srinivasan wrote:
> Kirill A. Shutemov with the commit 96bacfe542 introduced
> vm_ops->map_pages() for mapping easy accessible pages around
> fault address in hope to reduce number of minor page faults.
> Based on his workload runs, suggested FAULT_AROUND_ORDER
> (knob to control the numbers of pages to map) is 4.
> 
> This patch moves the FAULT_AROUND_ORDER macro to arch/ for
> architecture maintainers to decide on suitable FAULT_AROUND_ORDER
> value based on performance data for that architecture.
> 
> Signed-off-by: Madhavan Srinivasan 
> ---
>  arch/powerpc/include/asm/pgtable.h |6 ++
>  arch/x86/include/asm/pgtable.h |5 +
>  include/asm-generic/pgtable.h  |   10 ++
>  mm/memory.c|2 --
>  4 files changed, 21 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/pgtable.h 
> b/arch/powerpc/include/asm/pgtable.h
> index 3ebb188..9fcbd48 100644
> --- a/arch/powerpc/include/asm/pgtable.h
> +++ b/arch/powerpc/include/asm/pgtable.h
> @@ -19,6 +19,12 @@ struct mm_struct;
>  #endif
>  
>  /*
> + * With a few real world workloads that were run,
> + * the performance data showed that a value of 3 is more advantageous.
> + */
> +#define FAULT_AROUND_ORDER   3
> +
> +/*
>   * We save the slot number & secondary bit in the second half of the
>   * PTE page. We use the 8 bytes per each pte entry.
>   */
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index 938ef1d..8387a65 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -7,6 +7,11 @@
>  #include 
>  
>  /*
> + * Based on Kirill's test results, fault around order is set to 4
> + */
> +#define FAULT_AROUND_ORDER 4
> +
> +/*
>   * Macro to mark a page protection value as UC-
>   */
>  #define pgprot_noncached(prot)   \
> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> index 1ec08c1..62f7f07 100644
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -7,6 +7,16 @@
>  #include 
>  #include 
>  
> +
> +/*
> + * Fault around order is a control knob to decide the fault around pages.
> + * Default value is set to 0UL (disabled), but the arch can override it as
> + * desired.
> + */
> +#ifndef FAULT_AROUND_ORDER
> +#define FAULT_AROUND_ORDER   0UL
> +#endif

FAULT_AROUND_ORDER == 0 case should be handled separately in
do_read_fault(): no reason to go to do_fault_around() if we are going to
fault in only one page.

-- 
 Kirill A. Shutemov
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V2 1/2] mm: move FAULT_AROUND_ORDER to arch/

2014-04-04 Thread Kirill A. Shutemov
On Fri, Apr 04, 2014 at 11:57:14AM +0530, Madhavan Srinivasan wrote:
> Kirill A. Shutemov with faultaround patchset introduced
> vm_ops->map_pages() for mapping easy accessible pages around
> fault address in hope to reduce number of minor page faults.
> 
> This patch creates infrastructure to move the FAULT_AROUND_ORDER
> to arch/ using Kconfig. This will enable architecture maintainers
> to decide on suitable FAULT_AROUND_ORDER value based on
> performance data for that architecture. Patch also adds
> FAULT_AROUND_ORDER Kconfig element in arch/X86.
> 
> Signed-off-by: Madhavan Srinivasan 
> ---
>  arch/x86/Kconfig   |4 
>  include/linux/mm.h |9 +
>  mm/memory.c|   12 +---
>  3 files changed, 18 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 9c0a657..5833f22 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1177,6 +1177,10 @@ config DIRECT_GBPAGES
> support it. This can improve the kernel's performance a tiny bit by
> reducing TLB pressure. If in doubt, say "Y".
>  
> +config FAULT_AROUND_ORDER
> + int
> + default "4"
> +
>  # Common NUMA Features
>  config NUMA
>   bool "Numa Memory Allocation and Scheduler Support"
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 0bd4359..b93c1c3 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -26,6 +26,15 @@ struct file_ra_state;
>  struct user_struct;
>  struct writeback_control;
>  
> +/*
> + * Fault around order is a control knob to decide the fault around pages.
> + * Default value is set to 0UL (disabled), but the arch can override it as
> + * desired.
> + */
> +#ifndef CONFIG_FAULT_AROUND_ORDER
> +#define CONFIG_FAULT_AROUND_ORDER 0
> +#endif
> +

I don't think it should be in header file: nobody except mm/memory.c cares.
Just put it instead '#define FAULT_AROUND_ORDER'.

>  #ifndef CONFIG_NEED_MULTIPLE_NODES   /* Don't use mapnrs, do it properly */
>  extern unsigned long max_mapnr;
>  
> diff --git a/mm/memory.c b/mm/memory.c
> index b02c584..22a4a89 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3358,10 +3358,8 @@ void do_set_pte(struct vm_area_struct *vma, unsigned 
> long address,
>   update_mmu_cache(vma, address, pte);
>  }
>  
> -#define FAULT_AROUND_ORDER 4
> -
>  #ifdef CONFIG_DEBUG_FS
> -static unsigned int fault_around_order = FAULT_AROUND_ORDER;
> +static unsigned int fault_around_order = CONFIG_FAULT_AROUND_ORDER;
>  
>  static int fault_around_order_get(void *data, u64 *val)
>  {
> @@ -3371,7 +3369,7 @@ static int fault_around_order_get(void *data, u64 *val)
>  
>  static int fault_around_order_set(void *data, u64 val)
>  {
> - BUILD_BUG_ON((1UL << FAULT_AROUND_ORDER) > PTRS_PER_PTE);
> + BUILD_BUG_ON((1UL << CONFIG_FAULT_AROUND_ORDER) > PTRS_PER_PTE);
>   if (1UL << val > PTRS_PER_PTE)
>   return -EINVAL;
>   fault_around_order = val;
> @@ -3406,14 +3404,14 @@ static inline unsigned long fault_around_pages(void)
>  {
>   unsigned long nr_pages;
>  
> - nr_pages = 1UL << FAULT_AROUND_ORDER;
> + nr_pages = 1UL << CONFIG_FAULT_AROUND_ORDER;
>   BUILD_BUG_ON(nr_pages > PTRS_PER_PTE);
>   return nr_pages;
>  }
>  
>  static inline unsigned long fault_around_mask(void)
>  {
> - return ~((1UL << (PAGE_SHIFT + FAULT_AROUND_ORDER)) - 1);
> + return ~((1UL << (PAGE_SHIFT + CONFIG_FAULT_AROUND_ORDER)) - 1);
>  }
>  #endif
>  
> @@ -3471,7 +3469,7 @@ static int do_read_fault(struct mm_struct *mm, struct 
> vm_area_struct *vma,
>* if page by the offset is not ready to be mapped (cold cache or
>* something).
>*/
> - if (vma->vm_ops->map_pages) {
> + if ((vma->vm_ops->map_pages) && (fault_around_pages() > 1)) {

if (vma->vm_ops->map_pages && fault_around_pages()) {

>           pte = pte_offset_map_lock(mm, pmd, address, &ptl);
>   do_fault_around(vma, address, pte, pgoff, flags);
>   if (!pte_same(*pte, orig_pte))
> -- 
> 1.7.10.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
 Kirill A. Shutemov
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

RE: [PATCH V3 1/2] mm: move FAULT_AROUND_ORDER to arch/

2014-04-28 Thread Kirill A. Shutemov
Madhavan Srinivasan wrote:
> Kirill A. Shutemov with 8c6e50b029 commit introduced
> vm_ops->map_pages() for mapping easy accessible pages around
> fault address in hope to reduce number of minor page faults.
> 
> This patch creates infrastructure to modify the FAULT_AROUND_ORDER
> value using mm/Kconfig. This will enable architecture maintainers
> to decide on suitable FAULT_AROUND_ORDER value based on
> performance data for that architecture. Patch also defaults
> FAULT_AROUND_ORDER Kconfig element to 4.
> 
> Signed-off-by: Madhavan Srinivasan 
> ---
>  mm/Kconfig  |8 
>  mm/memory.c |   11 ---
>  2 files changed, 12 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/Kconfig b/mm/Kconfig
> index ebe5880..c7fc4f1 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -176,6 +176,14 @@ config MOVABLE_NODE
>  config HAVE_BOOTMEM_INFO_NODE
>   def_bool n
>  
> +#
> +# Fault around order is a control knob to decide the fault around pages.
> +# Default value is set to 4 , but the arch can override it as desired.
> +#
> +config FAULT_AROUND_ORDER
> + int
> + default 4
> +
>  # eventually, we can have this option just 'select SPARSEMEM'
>  config MEMORY_HOTPLUG
>   bool "Allow for memory hot-add"
> diff --git a/mm/memory.c b/mm/memory.c
> index d0f0bef..457436d 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3382,11 +3382,9 @@ void do_set_pte(struct vm_area_struct *vma, unsigned 
> long address,
>   update_mmu_cache(vma, address, pte);
>  }
>  
> -#define FAULT_AROUND_ORDER 4
> +unsigned int fault_around_order = CONFIG_FAULT_AROUND_ORDER;
>  
>  #ifdef CONFIG_DEBUG_FS
> -static unsigned int fault_around_order = FAULT_AROUND_ORDER;
> -
>  static int fault_around_order_get(void *data, u64 *val)
>  {
>   *val = fault_around_order;
> @@ -3395,7 +3393,6 @@ static int fault_around_order_get(void *data, u64 *val)
>  
>  static int fault_around_order_set(void *data, u64 val)
>  {
> - BUILD_BUG_ON((1UL << FAULT_AROUND_ORDER) > PTRS_PER_PTE);
>   if (1UL << val > PTRS_PER_PTE)
>   return -EINVAL;
>   fault_around_order = val;
> @@ -3430,14 +3427,14 @@ static inline unsigned long fault_around_pages(void)
>  {
>   unsigned long nr_pages;
>  
> - nr_pages = 1UL << FAULT_AROUND_ORDER;
> + nr_pages = 1UL << fault_around_order;
>   BUILD_BUG_ON(nr_pages > PTRS_PER_PTE);

This BUILD_BUG_ON() doesn't make sense any more since compiler usually can't
prove anything about extern variable.

Drop it or change to VM_BUG_ON() or something.

>   return nr_pages;
>  }
>  
>  static inline unsigned long fault_around_mask(void)
>  {
> - return ~((1UL << (PAGE_SHIFT + FAULT_AROUND_ORDER)) - 1);
> + return ~((1UL << (PAGE_SHIFT + fault_around_order)) - 1);

fault_around_pages() and fault_around_mask() should be moved outside of
#ifdef CONFIG_DEBUG_FS and consolidated since they are practically identical
both branches after the changes.

>  }
>  #endif
>  
> @@ -3495,7 +3492,7 @@ static int do_read_fault(struct mm_struct *mm, struct 
> vm_area_struct *vma,
>* if page by the offset is not ready to be mapped (cold cache or
>* something).
>    */
> - if (vma->vm_ops->map_pages) {
> + if ((vma->vm_ops->map_pages) && (fault_around_order)) {

Way too many parentheses.

>   pte = pte_offset_map_lock(mm, pmd, address, &ptl);
>   do_fault_around(vma, address, pte, pgoff, flags);
>   if (!pte_same(*pte, orig_pte))
-- 
 Kirill A. Shutemov
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V4 0/2] mm: FAULT_AROUND_ORDER patchset performance data for powerpc

2014-05-19 Thread Kirill A. Shutemov
Andrew Morton wrote:
> On Mon, 19 May 2014 16:23:07 -0700 (PDT) Hugh Dickins  
> wrote:
> 
> > Shouldn't FAULT_AROUND_ORDER and fault_around_order be changed to be
> > the order of the fault-around size in bytes, and fault_around_pages()
> > use 1UL << (fault_around_order - PAGE_SHIFT)
> 
> Yes.  And shame on me for missing it (this time!) at review.
> 
> There's still time to fix this.  Patches, please.

Here it is. Made at 3.30 AM, build tested only.

I'll sign it off tomorrow after testing.

diff --git a/mm/memory.c b/mm/memory.c
index 037b812a9531..9d6941c9a9e4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3402,62 +3402,62 @@ void do_set_pte(struct vm_area_struct *vma, unsigned 
long address,
update_mmu_cache(vma, address, pte);
 }
 
-#define FAULT_AROUND_ORDER 4
+#define FAULT_AROUND_BYTES 65536
 
 #ifdef CONFIG_DEBUG_FS
-static unsigned int fault_around_order = FAULT_AROUND_ORDER;
+static unsigned int fault_around_bytes = FAULT_AROUND_BYTES;
 
-static int fault_around_order_get(void *data, u64 *val)
+static int fault_around_bytes_get(void *data, u64 *val)
 {
-   *val = fault_around_order;
+   *val = fault_around_bytes;
return 0;
 }
 
-static int fault_around_order_set(void *data, u64 val)
+static int fault_around_bytes_set(void *data, u64 val)
 {
-   BUILD_BUG_ON((1UL << FAULT_AROUND_ORDER) > PTRS_PER_PTE);
-   if (1UL << val > PTRS_PER_PTE)
+   BUILD_BUG_ON(FAULT_AROUND_BYTES / PAGE_SIZE > PTRS_PER_PTE);
+   if (val / PAGE_SIZE > PTRS_PER_PTE)
return -EINVAL;
-   fault_around_order = val;
+   fault_around_bytes = val;
return 0;
 }
-DEFINE_SIMPLE_ATTRIBUTE(fault_around_order_fops,
-   fault_around_order_get, fault_around_order_set, "%llu\n");
+DEFINE_SIMPLE_ATTRIBUTE(fault_around_bytes_fops,
+   fault_around_bytes_get, fault_around_bytes_set, "%llu\n");
 
 static int __init fault_around_debugfs(void)
 {
void *ret;
 
-   ret = debugfs_create_file("fault_around_order", 0644, NULL, NULL,
-   &fault_around_order_fops);
+   ret = debugfs_create_file("fault_around_bytes", 0644, NULL, NULL,
+   &fault_around_bytes_fops);
if (!ret)
-   pr_warn("Failed to create fault_around_order in debugfs");
+   pr_warn("Failed to create fault_around_bytes in debugfs");
return 0;
 }
 late_initcall(fault_around_debugfs);
 
 static inline unsigned long fault_around_pages(void)
 {
-   return 1UL << fault_around_order;
+   return fault_around_bytes / PAGE_SIZE;
 }
 
 static inline unsigned long fault_around_mask(void)
 {
-   return ~((1UL << (PAGE_SHIFT + fault_around_order)) - 1);
+   return ~(round_down(fault_around_bytes, PAGE_SIZE) - 1);
 }
 #else
 static inline unsigned long fault_around_pages(void)
 {
unsigned long nr_pages;
 
-   nr_pages = 1UL << FAULT_AROUND_ORDER;
+   nr_pages = FAULT_AROUND_BYTES / PAGE_SIZE;
BUILD_BUG_ON(nr_pages > PTRS_PER_PTE);
return nr_pages;
 }
 
 static inline unsigned long fault_around_mask(void)
 {
-   return ~((1UL << (PAGE_SHIFT + FAULT_AROUND_ORDER)) - 1);
+   return ~(round_down(FAULT_AROUND_BYTES, PAGE_SIZE) - 1);
 }
 #endif
 
@@ -3515,7 +3515,7 @@ static int do_read_fault(struct mm_struct *mm, struct 
vm_area_struct *vma,
 * if page by the offset is not ready to be mapped (cold cache or
 * something).
 */
-   if (vma->vm_ops->map_pages) {
+   if (vma->vm_ops->map_pages && fault_around_pages() > 1) {
    pte = pte_offset_map_lock(mm, pmd, address, &ptl);
do_fault_around(vma, address, pte, pgoff, flags);
if (!pte_same(*pte, orig_pte))
-- 
 Kirill A. Shutemov
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V4 0/2] mm: FAULT_AROUND_ORDER patchset performance data for powerpc

2014-05-20 Thread Kirill A. Shutemov
Rusty Russell wrote:
> "Kirill A. Shutemov"  writes:
> > Andrew Morton wrote:
> >> On Mon, 19 May 2014 16:23:07 -0700 (PDT) Hugh Dickins  
> >> wrote:
> >> 
> >> > Shouldn't FAULT_AROUND_ORDER and fault_around_order be changed to be
> >> > the order of the fault-around size in bytes, and fault_around_pages()
> >> > use 1UL << (fault_around_order - PAGE_SHIFT)
> >> 
> >> Yes.  And shame on me for missing it (this time!) at review.
> >> 
> >> There's still time to fix this.  Patches, please.
> >
> > Here it is. Made at 3.30 AM, build tested only.
> 
> Prefer on top of Maddy's patch which makes it always a variable, rather
> than CONFIG_DEBUG_FS.  It's got enough hair as it is.

Something like this?

From: "Kirill A. Shutemov" 
Date: Tue, 20 May 2014 13:02:03 +0300
Subject: [PATCH] mm: nominate faultaround area in bytes rather then page order

There are evidences that faultaround feature is less relevant on
architectures with page size bigger then 4k. Which makes sense since
page fault overhead per byte of mapped area should be less there.

Let's rework the feature to specify faultaround area in bytes instead of
page order. It's 64 kilobytes for now.

The patch effectively disables faultaround on architectures with
page size >= 64k (like ppc64).

It's possible that some other size of faultaround area is relevant for a
platform. We can expose `fault_around_bytes' variable to arch-specific
code once such platforms will be found.

Signed-off-by: Kirill A. Shutemov 
---
 mm/memory.c | 62 +++--
 1 file changed, 23 insertions(+), 39 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 037b812a9531..252b319e8cdf 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3402,63 +3402,47 @@ void do_set_pte(struct vm_area_struct *vma, unsigned 
long address,
update_mmu_cache(vma, address, pte);
 }
 
-#define FAULT_AROUND_ORDER 4
+static unsigned long fault_around_bytes = 65536;
+
+static inline unsigned long fault_around_pages(void)
+{
+   return rounddown_pow_of_two(fault_around_bytes) / PAGE_SIZE;
+}
+
+static inline unsigned long fault_around_mask(void)
+{
+   return ~(rounddown_pow_of_two(fault_around_bytes) - 1) & PAGE_MASK;
+}
 
-#ifdef CONFIG_DEBUG_FS
-static unsigned int fault_around_order = FAULT_AROUND_ORDER;
 
-static int fault_around_order_get(void *data, u64 *val)
+#ifdef CONFIG_DEBUG_FS
+static int fault_around_bytes_get(void *data, u64 *val)
 {
-   *val = fault_around_order;
+   *val = fault_around_bytes;
return 0;
 }
 
-static int fault_around_order_set(void *data, u64 val)
+static int fault_around_bytes_set(void *data, u64 val)
 {
-   BUILD_BUG_ON((1UL << FAULT_AROUND_ORDER) > PTRS_PER_PTE);
-   if (1UL << val > PTRS_PER_PTE)
+   if (val / PAGE_SIZE > PTRS_PER_PTE)
return -EINVAL;
-   fault_around_order = val;
+   fault_around_bytes = val;
return 0;
 }
-DEFINE_SIMPLE_ATTRIBUTE(fault_around_order_fops,
-   fault_around_order_get, fault_around_order_set, "%llu\n");
+DEFINE_SIMPLE_ATTRIBUTE(fault_around_bytes_fops,
+   fault_around_bytes_get, fault_around_bytes_set, "%llu\n");
 
 static int __init fault_around_debugfs(void)
 {
void *ret;
 
-   ret = debugfs_create_file("fault_around_order", 0644, NULL, NULL,
-   &fault_around_order_fops);
+   ret = debugfs_create_file("fault_around_bytes", 0644, NULL, NULL,
+   &fault_around_bytes_fops);
if (!ret)
-   pr_warn("Failed to create fault_around_order in debugfs");
+   pr_warn("Failed to create fault_around_bytes in debugfs");
return 0;
 }
 late_initcall(fault_around_debugfs);
-
-static inline unsigned long fault_around_pages(void)
-{
-   return 1UL << fault_around_order;
-}
-
-static inline unsigned long fault_around_mask(void)
-{
-   return ~((1UL << (PAGE_SHIFT + fault_around_order)) - 1);
-}
-#else
-static inline unsigned long fault_around_pages(void)
-{
-   unsigned long nr_pages;
-
-   nr_pages = 1UL << FAULT_AROUND_ORDER;
-   BUILD_BUG_ON(nr_pages > PTRS_PER_PTE);
-   return nr_pages;
-}
-
-static inline unsigned long fault_around_mask(void)
-{
-   return ~((1UL << (PAGE_SHIFT + FAULT_AROUND_ORDER)) - 1);
-}
 #endif
 
 static void do_fault_around(struct vm_area_struct *vma, unsigned long address,
@@ -3515,7 +3499,7 @@ static int do_read_fault(struct mm_struct *mm, struct 
vm_area_struct *vma,
 * if page by the offset is not ready to be mapped (cold cache or
     * something).
 */
-   if (vma->vm_ops->map_pages) {
+   if (vma->vm_ops-&

Re: [PATCH V4 0/2] mm: FAULT_AROUND_ORDER patchset performance data for powerpc

2014-05-21 Thread Kirill A. Shutemov
Andrew Morton wrote:
> On Tue, 20 May 2014 13:27:38 +0300 (EEST) "Kirill A. Shutemov" 
>  wrote:
> 
> > Rusty Russell wrote:
> > > "Kirill A. Shutemov"  writes:
> > > > Andrew Morton wrote:
> > > >> On Mon, 19 May 2014 16:23:07 -0700 (PDT) Hugh Dickins 
> > > >>  wrote:
> > > >> 
> > > >> > Shouldn't FAULT_AROUND_ORDER and fault_around_order be changed to be
> > > >> > the order of the fault-around size in bytes, and fault_around_pages()
> > > >> > use 1UL << (fault_around_order - PAGE_SHIFT)
> > > >> 
> > > >> Yes.  And shame on me for missing it (this time!) at review.
> > > >> 
> > > >> There's still time to fix this.  Patches, please.
> > > >
> > > > Here it is. Made at 3.30 AM, build tested only.
> > > 
> > > Prefer on top of Maddy's patch which makes it always a variable, rather
> > > than CONFIG_DEBUG_FS.  It's got enough hair as it is.
> > 
> > Something like this?
> 
> This appears to be against mainline, not against Madhavan's patch.  As
> mentioned previously, I'd prefer it that way but confused.
> 
> 
> > From: "Kirill A. Shutemov" 
> > Date: Tue, 20 May 2014 13:02:03 +0300
> > Subject: [PATCH] mm: nominate faultaround area in bytes rather then page 
> > order
> > 
> > There are evidences that faultaround feature is less relevant on
> > architectures with page size bigger then 4k. Which makes sense since
> > page fault overhead per byte of mapped area should be less there.
> > 
> > Let's rework the feature to specify faultaround area in bytes instead of
> > page order. It's 64 kilobytes for now.
> > 
> > The patch effectively disables faultaround on architectures with
> > page size >= 64k (like ppc64).
> > 
> > It's possible that some other size of faultaround area is relevant for a
> > platform. We can expose `fault_around_bytes' variable to arch-specific
> > code once such platforms will be found.
> > 
> > Signed-off-by: Kirill A. Shutemov 
> > ---
> >  mm/memory.c | 62 
> > +++--
> >  1 file changed, 23 insertions(+), 39 deletions(-)
> > 
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 037b812a9531..252b319e8cdf 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3402,63 +3402,47 @@ void do_set_pte(struct vm_area_struct *vma, 
> > unsigned long address,
> > update_mmu_cache(vma, address, pte);
> >  }
> >  
> > -#define FAULT_AROUND_ORDER 4
> > +static unsigned long fault_around_bytes = 65536;
> > +
> > +static inline unsigned long fault_around_pages(void)
> > +{
> > +   return rounddown_pow_of_two(fault_around_bytes) / PAGE_SIZE;
> > +}
> 
> I think we should round up, not down.  So if the user asks for 1kb,
> they get one page.
> 
> So this becomes
> 
>   return PAGE_ALIGN(fault_around_bytes) / PAGE_SIZE;

See below.

> > +static inline unsigned long fault_around_mask(void)
> > +{
> > +   return ~(rounddown_pow_of_two(fault_around_bytes) - 1) & PAGE_MASK;
> > +}
> 
> And this has me a bit stumped.  It's not helpful that do_fault_around()
> is undocumented.  Does it fault in N/2 pages ahead and N/2 pages
> behind?  Or does it align the address down to the highest multiple of
> fault_around_bytes?  It appears to be the latter, so the location of
> the faultaround window around the fault address is basically random,
> depending on what address userspace happened to pick.  I don't know why
> we did this :(

When we call ->map_pages() we need to make sure that we stay within VMA
and the page table. We don't want to cross page table boundary, because
page table is what ptlock covers in split ptlock case.

I've designed the feature with fault area nominated in page order in mind
and I found it's easier to make sure we don't cross boundaries, if we
would align virtual address of fault around area to PAGE_SIZE <<
FAULT_AROUND_ORDER.

And yes fault address may be anywhere within the area. You can think about
this as a virtual page with size PAGE_SIZE << FAULT_AROUND_ORDER: no matter
what is fault address, we handle area naturally aligned to page size which
fault address belong to.

I've used rounddown_pow_of_two() in the patch to align to nearest page
order, not to page size, because that's what current do_fault_around()
expect to see. And roundup is not an option: nobody expects fault around

Re: [PATCH V4 0/2] mm: FAULT_AROUND_ORDER patchset performance data for powerpc

2014-05-23 Thread Kirill A. Shutemov
Andrew Morton wrote:
> On Wed, 21 May 2014 16:40:27 +0300 (EEST) "Kirill A. Shutemov" 
>  wrote:
> 
> > > Or something.  Can we please get some code commentary over
> > > do_fault_around() describing this design decision and explaining the
> > > reasoning behind it?
> > 
> > I'll do this. But if do_fault_around() rework is needed, I want to do that
> > first.
> 
> This sort of thing should be at least partially driven by observation
> and I don't have the data for that.  My seat of the pants feel is that
> after the first fault, accesses at higher addresses are more
> common/probable than accesses at lower addresses.

It's probably true for data, but the feature is mostly targeted to code pages
and situation is not that obvious to me with all jumps.

> But we don't need to do all that right now.  Let's get the current
> implementation wrapped up for 3.15: get the interface finalized (bytes,
> not pages!)

The patch above by thread is okay for that, right?

> and get the current design decisions appropriately documented.

Here it is. Based on patch to convert order->bytes.

From: "Kirill A. Shutemov" 
Date: Fri, 23 May 2014 15:16:47 +0300
Subject: [PATCH] mm: document do_fault_around() feature

Some clarification on how faultaround works.

Signed-off-by: Kirill A. Shutemov 
---
 mm/memory.c | 27 +++
 1 file changed, 27 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 252b319e8cdf..8d723b8d3c86 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3404,6 +3404,10 @@ void do_set_pte(struct vm_area_struct *vma, unsigned 
long address,
 
 static unsigned long fault_around_bytes = 65536;
 
+/*
+ * fault_around_pages() and fault_around_mask() round down fault_around_bytes
+ * to nearest page order. It's what do_fault_around() expects to see.
+ */
 static inline unsigned long fault_around_pages(void)
 {
return rounddown_pow_of_two(fault_around_bytes) / PAGE_SIZE;
@@ -3445,6 +3449,29 @@ static int __init fault_around_debugfs(void)
 late_initcall(fault_around_debugfs);
 #endif
 
+/*
+ * do_fault_around() tries to map few pages around the fault address. The hope
+ * is that the pages will be needed soon and this would lower the number of
+ * faults to handle.
+ *
+ * It uses vm_ops->map_pages() to map the pages, which skips the page if it's
+ * not ready to be mapped: not up-to-date, locked, etc.
+ *
+ * This function is called with the page table lock taken. In the split ptlock
+ * case the page table lock only protects only those entries which belong to
+ * page table corresponding to the fault address.
+ *
+ * This function don't cross the VMA boundaries in order to call map_pages()
+ * only once.
+ *
+ * fault_around_pages() defines how many pages we'll try to map.
+ * do_fault_around() expects it to be power of two and less or equal to
+ * PTRS_PER_PTE.
+ *
+ * The virtual address of the area that we map is naturally aligned to the
+ * fault_around_pages() (and therefore to page order). This way it's easier to
+ * guarantee that we don't cross the page table boundaries.
+ */
 static void do_fault_around(struct vm_area_struct *vma, unsigned long address,
pte_t *pte, pgoff_t pgoff, unsigned int flags)
 {
-- 
 Kirill A. Shutemov
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V4 0/2] mm: FAULT_AROUND_ORDER patchset performance data for powerpc

2014-05-27 Thread Kirill A. Shutemov
Madhavan Srinivasan wrote:
> On Tuesday 20 May 2014 03:57 PM, Kirill A. Shutemov wrote:
> > Rusty Russell wrote:
> >> "Kirill A. Shutemov"  writes:
> >>> Andrew Morton wrote:
> >>>> On Mon, 19 May 2014 16:23:07 -0700 (PDT) Hugh Dickins  
> >>>> wrote:
> >>>>
> >>>>> Shouldn't FAULT_AROUND_ORDER and fault_around_order be changed to be
> >>>>> the order of the fault-around size in bytes, and fault_around_pages()
> >>>>> use 1UL << (fault_around_order - PAGE_SHIFT)
> >>>>
> >>>> Yes.  And shame on me for missing it (this time!) at review.
> >>>>
> >>>> There's still time to fix this.  Patches, please.
> >>>
> >>> Here it is. Made at 3.30 AM, build tested only.
> >>
> >> Prefer on top of Maddy's patch which makes it always a variable, rather
> >> than CONFIG_DEBUG_FS.  It's got enough hair as it is.
> > 
> > Something like this?
> > 
> > From: "Kirill A. Shutemov" 
> > Date: Tue, 20 May 2014 13:02:03 +0300
> > Subject: [PATCH] mm: nominate faultaround area in bytes rather then page 
> > order
> > 
> > There are evidences that faultaround feature is less relevant on
> > architectures with page size bigger then 4k. Which makes sense since
> > page fault overhead per byte of mapped area should be less there.
> > 
> > Let's rework the feature to specify faultaround area in bytes instead of
> > page order. It's 64 kilobytes for now.
> > 
> > The patch effectively disables faultaround on architectures with
> > page size >= 64k (like ppc64).
> > 
> > It's possible that some other size of faultaround area is relevant for a
> > platform. We can expose `fault_around_bytes' variable to arch-specific
> > code once such platforms will be found.
> > 
> > Signed-off-by: Kirill A. Shutemov 
> > ---
> >  mm/memory.c | 62 
> > +++--
> >  1 file changed, 23 insertions(+), 39 deletions(-)
> > 
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 037b812a9531..252b319e8cdf 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3402,63 +3402,47 @@ void do_set_pte(struct vm_area_struct *vma, 
> > unsigned long address,
> > update_mmu_cache(vma, address, pte);
> >  }
> > 
> > -#define FAULT_AROUND_ORDER 4
> > +static unsigned long fault_around_bytes = 65536;
> > +
> > +static inline unsigned long fault_around_pages(void)
> > +{
> > +   return rounddown_pow_of_two(fault_around_bytes) / PAGE_SIZE;
> > +}
> > +
> > +static inline unsigned long fault_around_mask(void)
> > +{
> > +   return ~(rounddown_pow_of_two(fault_around_bytes) - 1) & PAGE_MASK;
> > +}
> > 
> > -#ifdef CONFIG_DEBUG_FS
> > -static unsigned int fault_around_order = FAULT_AROUND_ORDER;
> > 
> > -static int fault_around_order_get(void *data, u64 *val)
> > +#ifdef CONFIG_DEBUG_FS
> > +static int fault_around_bytes_get(void *data, u64 *val)
> >  {
> > -   *val = fault_around_order;
> > +   *val = fault_around_bytes;
> > return 0;
> >  }
> > 
> > -static int fault_around_order_set(void *data, u64 val)
> > +static int fault_around_bytes_set(void *data, u64 val)
> >  {
> 
> Kindly ignore the question if not relevant. Even though we need root
> access to alter the value, will we be fine with
> negative value?.

val is u64. or I miss something?

-- 
 Kirill A. Shutemov
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [BUG] random kernel crashes after THP rework on s390 (maybe also on PowerPC and ARM)

2016-02-17 Thread Kirill A. Shutemov
On Tue, Feb 16, 2016 at 05:24:44PM +0100, Gerald Schaefer wrote:
> On Mon, 15 Feb 2016 23:35:26 +0200
> "Kirill A. Shutemov"  wrote:
> 
> > Is there any chance that I'll be able to trigger the bug using QEMU?
> > Does anybody have an QEMU image I can use?
> > 
> 
> I have no image, but trying to reproduce this under virtualization may
> help to trigger this also on other architectures. After ruling out IPI
> vs. fast_gup I do not really see why this should be arch-specific, and
> it wouldn't be the first time that we hit subtle races first on s390, due
> to our virtualized environment (my test case is make -j20 with 10 CPUs and
> 4GB of memory, no swap).

Could you post your kernel config?

It would be nice also to check if disabling split_huge_page() would make
any difference:

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a75081ca31cf..26d2b7b21021 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3364,6 +3364,8 @@ int split_huge_page_to_list(struct page *page, struct 
list_head *list)
bool mlocked;
unsigned long flags;
 
+   return -EBUSY;
+
VM_BUG_ON_PAGE(is_huge_zero_page(page), page);
VM_BUG_ON_PAGE(!PageAnon(page), page);
VM_BUG_ON_PAGE(!PageLocked(page), page);
-- 
 Kirill A. Shutemov
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [BUG] random kernel crashes after THP rework on s390 (maybe also on PowerPC and ARM)

2016-02-17 Thread Kirill A. Shutemov
On Wed, Feb 17, 2016 at 08:13:40PM +0100, Gerald Schaefer wrote:
> On Sat, 13 Feb 2016 12:58:31 +0100 (CET)
> Sebastian Ott  wrote:
> 
> > [   59.875935] [ cut here ]
> > [   59.875937] kernel BUG at mm/huge_memory.c:2884!
> > [   59.875979] illegal operation: 0001 ilc:1 [#1] PREEMPT SMP 
> > DEBUG_PAGEALLOC
> > [   59.875986] Modules linked in: bridge stp llc btrfs xor mlx4_en vxlan 
> > ip6_udp_tunnel udp_tunnel mlx4_ib ptp pps_core ib_sa ib_mad ib_core ib_addr 
> > ghash_s390 prng raid6_pq ecb aes_s390 des_s390 des_generic sha512_s390 
> > sha256_s390 sha1_s390 mlx4_core sha_common genwqe_card scm_block crc_itu_t 
> > vhost_net tun vhost dm_mod macvtap eadm_sch macvlan kvm autofs4
> > [   59.876033] CPU: 2 PID: 5402 Comm: git Tainted: GW   
> > 4.4.0-07794-ga4eff16-dirty #77
> > [   59.876036] task: d2312948 ti: cfecc000 task.ti: 
> > cfecc000
> > [   59.876039] Krnl PSW : 0704d0018000 002bf3aa 
> > (__split_huge_pmd_locked+0x562/0xa10)
> > [   59.876045]R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 
> > PM:0 EA:3
> >Krnl GPRS: 01a7a1cf 03d10177c000 
> > 00044068 5df00215
> > [   59.876051]0001 0001 
> >  774e6900
> > [   59.876054]03ff5200 6d403b10 
> > 6e1eb800 03ff51f0
> > [   59.876058]03d10177c000 00715190 
> > 002bf234 cfecfb58
> > [   59.876068] Krnl Code: 002bf39c: d507d010a000clc 
> > 16(8,%%r13),0(%%r10)
> >   002bf3a2: a7840004brc 
> > 8,2bf3aa
> >  #002bf3a6: a7f40001brc 
> > 15,2bf3a8
> >  >002bf3aa: 91407440tm  
> > 1088(%%r7),64
> >   002bf3ae: a7840208brc 
> > 8,2bf7be
> >   002bf3b2: a7f401e9brc 
> > 15,2bf784
> >   002bf3b6: 9104a006tm  
> > 6(%%r10),4
> >   002bf3ba: a7740004brc 
> > 7,2bf3c2
> > [   59.876089] Call Trace:
> > [   59.876092] ([<002bf234>] __split_huge_pmd_locked+0x3ec/0xa10)
> > [   59.876095]  [<002c4310>] __split_huge_pmd+0x118/0x218
> > [   59.876099]  [<002810e8>] unmap_single_vma+0x2d8/0xb40
> > [   59.876102]  [<00282d66>] zap_page_range+0x116/0x318
> > [   59.876105]  [<0029b834>] SyS_madvise+0x23c/0x5e8
> > [   59.876108]  [<006f9f56>] system_call+0xd6/0x258
> > [   59.876111]  [<03ff9bbfd282>] 0x3ff9bbfd282
> > [   59.876113] INFO: lockdep is turned off.
> > [   59.876115] Last Breaking-Event-Address:
> > [   59.876118]  [<002bf3a6>] __split_huge_pmd_locked+0x55e/0xa10
> 
> The BUG at mm/huge_memory.c:2884 is interesting, it's the 
> BUG_ON(!pte_none(*pte))
> check in __split_huge_pmd_locked(). Obviously we expect the pre-allocated
> pagetables to be empty, but in collapse_huge_page() we deposit the original
> pagetable instead of allocating a new (empty) one. This saves an allocation,
> which is good, but doesn't that mean that if such a collapsed hugepage will
> ever be split, we will always run into the BUG_ON(!pte_none(*pte)), or one
> of the two other VM_BUG_ONs in mm/huge_memory.c that check the same?
> 
> This behavior is not new, it was the same before the THP rework, so I do not
> assume that it is related to the current problems, maybe with the exception
> of this specific crash. I never saw the BUG at mm/huge_memory.c:2884 myself,
> and the other crashes probably cannot be explained with this. Maybe I am
> also missing something, but I do not see how collapse_huge_page() and the
> (non-empty) pgtable deposit there can work out with the 
> BUG_ON(!pte_none(*pte))
> checks. Any thoughts?

I don't think there's a problem: ptes in the pgtable are cleared with
pte_clear() in __collapse_huge_page_copy().

-- 
 Kirill A. Shutemov
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [BUG] random kernel crashes after THP rework on s390 (maybe also on PowerPC and ARM)

2016-02-18 Thread Kirill A. Shutemov
On Thu, Feb 18, 2016 at 04:00:37PM +0100, Gerald Schaefer wrote:
> On Thu, 18 Feb 2016 01:58:08 +0200
> "Kirill A. Shutemov"  wrote:
> 
> > On Wed, Feb 17, 2016 at 08:13:40PM +0100, Gerald Schaefer wrote:
> > > On Sat, 13 Feb 2016 12:58:31 +0100 (CET)
> > > Sebastian Ott  wrote:
> > > 
> > > > [   59.875935] [ cut here ]
> > > > [   59.875937] kernel BUG at mm/huge_memory.c:2884!
> > > > [   59.875979] illegal operation: 0001 ilc:1 [#1] PREEMPT SMP 
> > > > DEBUG_PAGEALLOC
> > > > [   59.875986] Modules linked in: bridge stp llc btrfs xor mlx4_en 
> > > > vxlan ip6_udp_tunnel udp_tunnel mlx4_ib ptp pps_core ib_sa ib_mad 
> > > > ib_core ib_addr ghash_s390 prng raid6_pq ecb aes_s390 des_s390 
> > > > des_generic sha512_s390 sha256_s390 sha1_s390 mlx4_core sha_common 
> > > > genwqe_card scm_block crc_itu_t vhost_net tun vhost dm_mod macvtap 
> > > > eadm_sch macvlan kvm autofs4
> > > > [   59.876033] CPU: 2 PID: 5402 Comm: git Tainted: GW   
> > > > 4.4.0-07794-ga4eff16-dirty #77
> > > > [   59.876036] task: d2312948 ti: cfecc000 task.ti: 
> > > > cfecc000
> > > > [   59.876039] Krnl PSW : 0704d0018000 002bf3aa 
> > > > (__split_huge_pmd_locked+0x562/0xa10)
> > > > [   59.876045]R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 
> > > > PM:0 EA:3
> > > >Krnl GPRS: 01a7a1cf 03d10177c000 
> > > > 00044068 5df00215
> > > > [   59.876051]0001 0001 
> > > >  774e6900
> > > > [   59.876054]03ff5200 6d403b10 
> > > > 6e1eb800 03ff51f0
> > > > [   59.876058]03d10177c000 00715190 
> > > > 002bf234 cfecfb58
> > > > [   59.876068] Krnl Code: 002bf39c: d507d010a000clc 
> > > > 16(8,%%r13),0(%%r10)
> > > >   002bf3a2: a7840004brc 
> > > > 8,2bf3aa
> > > >  #002bf3a6: a7f40001brc 
> > > > 15,2bf3a8
> > > >  >002bf3aa: 91407440tm  
> > > > 1088(%%r7),64
> > > >   002bf3ae: a7840208brc 
> > > > 8,2bf7be
> > > >   002bf3b2: a7f401e9brc 
> > > > 15,2bf784
> > > >   002bf3b6: 9104a006tm  
> > > > 6(%%r10),4
> > > >   002bf3ba: a7740004brc 
> > > > 7,2bf3c2
> > > > [   59.876089] Call Trace:
> > > > [   59.876092] ([<002bf234>] 
> > > > __split_huge_pmd_locked+0x3ec/0xa10)
> > > > [   59.876095]  [<002c4310>] __split_huge_pmd+0x118/0x218
> > > > [   59.876099]  [<002810e8>] unmap_single_vma+0x2d8/0xb40
> > > > [   59.876102]  [<00282d66>] zap_page_range+0x116/0x318
> > > > [   59.876105]  [<0029b834>] SyS_madvise+0x23c/0x5e8
> > > > [   59.876108]  [<006f9f56>] system_call+0xd6/0x258
> > > > [   59.876111]  [<03ff9bbfd282>] 0x3ff9bbfd282
> > > > [   59.876113] INFO: lockdep is turned off.
> > > > [   59.876115] Last Breaking-Event-Address:
> > > > [   59.876118]  [<002bf3a6>] __split_huge_pmd_locked+0x55e/0xa10
> > > 
> > > The BUG at mm/huge_memory.c:2884 is interesting, it's the 
> > > BUG_ON(!pte_none(*pte))
> > > check in __split_huge_pmd_locked(). Obviously we expect the pre-allocated
> > > pagetables to be empty, but in collapse_huge_page() we deposit the 
> > > original
> > > pagetable instead of allocating a new (empty) one. This saves an 
> > > allocation,
> > > which is good, but doesn't that mean that if such a collapsed hugepage 
> > > will
> > > ever be split, we will always run into the BUG_ON(!pte_none(*pte)), or one
> > > of the two other VM_BUG_ONs in mm/huge_memory.c that check the same?
> > > 
> > > This behavior is not new, it was the same before the THP rework, so I do 
> > > not
> > > assume that it is related to the current problems, may

Re: [BUG] random kernel crashes after THP rework on s390 (maybe also on PowerPC and ARM)

2016-02-23 Thread Kirill A. Shutemov
On Fri, Feb 12, 2016 at 06:16:40PM +0100, Gerald Schaefer wrote:
> On Fri, 12 Feb 2016 16:57:27 +0100
> Christian Borntraeger  wrote:
> 
> > > I'm also confused by pmd_none() is equal to !pmd_present() on s390. Hm?
> > 
> > Don't know, Gerald or Martin?
> 
> The implementation frequently changes depending on how many new bits Martin
> needs to squeeze out :-)
> We don't have a _PAGE_PRESENT bit for pmds, so pmd_present() just checks if 
> the
> entry is not empty. pmd_none() of course does the opposite, it checks if it is
> empty.

I still worry about pmd_present(). It looks wrong to me. I wounder if
patch below makes a difference.

The theory is that the splitting bit effetely masked bogus pmd_present():
we had pmd_trans_splitting() in all code path and that prevented mm from
touching the pmd. Once pmd_trans_splitting() has gone, mm proceed with the
pmd where it shouldn't and here's a boom.

I'm not sure that the patch is correct wrt yound/old pmds and I have no
way to test it...

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 64ead8091248..2eeb17ab68ac 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -490,7 +490,7 @@ static inline int pud_bad(pud_t pud)
 
 static inline int pmd_present(pmd_t pmd)
 {
-   return pmd_val(pmd) != _SEGMENT_ENTRY_INVALID;
+   return !(pmd_val(pmd) & _SEGMENT_ENTRY_INVALID);
 }
 
 static inline int pmd_none(pmd_t pmd)
-- 
 Kirill A. Shutemov
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Question on follow_page_mask

2016-02-23 Thread Kirill A. Shutemov
On Tue, Feb 23, 2016 at 06:45:05PM +0530, Anshuman Khandual wrote:
> Not able to understand the first code block of follow_page_mask
> function. follow_huge_addr function is expected to find the page
> struct for the given address if it turns out to be a HugeTLB page
> but then when it finds the page we bug on if it had been called
> with FOLL_GET flag.
> 
>   page = follow_huge_addr(mm, address, flags & FOLL_WRITE);
>   if (!IS_ERR(page)) {
>   BUG_ON(flags & FOLL_GET);
>   return page;
>   }
> 
> do_move_page_to_node_array calls follow_page with FOLL_GET which
> in turn calls follow_page_mask with FOLL_GET. On POWER, the
> function follow_huge_addr is defined and does not return -EINVAL
> like the generic one. It returns the page struct if its a HugeTLB
> page. Just curious to know what is the purpose behind the BUG_ON.

I would guess requesting pin on non-reclaimable page is considered
useless, meaning suspicius behavior. BUG_ON() is overkill, I think.
WARN_ON_ONCE() would make it.

Not that this follow_huge_addr() on Power is not reachable via
do_move_page_to_node_array(), because the vma is !vma_is_migratable().

-- 
 Kirill A. Shutemov
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [BUG] random kernel crashes after THP rework on s390 (maybe also on PowerPC and ARM)

2016-02-23 Thread Kirill A. Shutemov
On Tue, Feb 23, 2016 at 07:19:07PM +0100, Gerald Schaefer wrote:
> I'll check with Martin, maybe it is actually trivial, then we can
> do a quick test it to rule that one out.

Oh. I found a bug in __split_huge_pmd_locked(). Although, not sure if it's
_the_ bug.

pmdp_invalidate() is called for the wrong address :-/
I guess that can be destructive on the architecture, right?

Could you check this?

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1c317b85ea7d..4246bc70e55a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2865,7 +2865,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct 
*vma, pmd_t *pmd,
pgtable = pgtable_trans_huge_withdraw(mm, pmd);
pmd_populate(mm, &_pmd, pgtable);
 
-   for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
+   for (i = 0; i < HPAGE_PMD_NR; i++) {
pte_t entry, *pte;
/*
 * Note that NUMA hinting access restrictions are not
@@ -2886,9 +2886,9 @@ static void __split_huge_pmd_locked(struct vm_area_struct 
*vma, pmd_t *pmd,
}
if (dirty)
SetPageDirty(page + i);
-   pte = pte_offset_map(&_pmd, haddr);
+   pte = pte_offset_map(&_pmd, haddr + i * PAGE_SIZE);
BUG_ON(!pte_none(*pte));
-   set_pte_at(mm, haddr, pte, entry);
+   set_pte_at(mm, haddr + i * PAGE_SIZE, pte, entry);
atomic_inc(&page[i]._mapcount);
pte_unmap(pte);
}
@@ -2938,7 +2938,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct 
*vma, pmd_t *pmd,
pmd_populate(mm, pmd, pgtable);
 
if (freeze) {
-   for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
+   for (i = 0; i < HPAGE_PMD_NR; i++) {
page_remove_rmap(page + i, false);
    put_page(page + i);
}
-- 
 Kirill A. Shutemov
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [BUG] random kernel crashes after THP rework on s390 (maybe also on PowerPC and ARM)

2016-02-25 Thread Kirill A. Shutemov
On Thu, Feb 25, 2016 at 03:49:33PM +, Steve Capper wrote:
> On 23 February 2016 at 18:47, Will Deacon  wrote:
> > [adding Steve, since he worked on THP for 32-bit ARM]
> 
> Apologies for my late reply...
> 
> >
> > On Tue, Feb 23, 2016 at 07:19:07PM +0100, Gerald Schaefer wrote:
> >> On Tue, 23 Feb 2016 13:32:21 +0300
> >> "Kirill A. Shutemov"  wrote:
> >> > The theory is that the splitting bit effetely masked bogus pmd_present():
> >> > we had pmd_trans_splitting() in all code path and that prevented mm from
> >> > touching the pmd. Once pmd_trans_splitting() has gone, mm proceed with 
> >> > the
> >> > pmd where it shouldn't and here's a boom.
> >>
> >> Well, I don't think pmd_present() == true is bogus for a trans_huge pmd 
> >> under
> >> splitting, after all there is a page behind the the pmd. Also, if it was
> >> bogus, and it would need to be false, why should it be marked 
> >> !pmd_present()
> >> only at the pmdp_invalidate() step before the pmd_populate()? It clearly
> >> is pmd_present() before that, on all architectures, and if there was any
> >> problem/race with that, setting it to !pmd_present() at this stage would
> >> only (marginally) reduce the race window.
> >>
> >> BTW, PowerPC and Sparc seem to do the same thing in pmdp_invalidate(),
> >> i.e. they do not set pmd_present() == false, only mark it so that it would
> >> not generate a new TLB entry, just like on s390. After all, the function
> >> is called pmdp_invalidate(), and I think the comment in mm/huge_memory.c
> >> before that call is just a little ambiguous in its wording. When it says
> >> "mark the pmd notpresent" it probably means "mark it so that it will not
> >> generate a new TLB entry", which is also what the comment is really about:
> >> prevent huge and small entries in the TLB for the same page at the same
> >> time.
> >>
> >> FWIW, and since the ARM arch-list is already on cc, I think there is
> >> an issue with pmdp_invalidate() on ARM, since it also seems to clear
> >> the trans_huge (and formerly trans_splitting) bit, which actually makes
> >> the pmd !pmd_present(), but it violates the other requirement from the
> >> comment:
> >> "the pmd_trans_huge and pmd_trans_splitting must remain set at all times
> >> on the pmd until the split is complete for this pmd"
> >
> > I've only been testing this for arm64 (where I'm yet to see a problem),
> > but we use the generic pmdp_invalidate implementation from
> > mm/pgtable-generic.c there. On arm64, pmd_trans_huge will return true
> > after pmd_mknotpresent. On arm, it does look to be buggy, since it nukes
> > the entire entry... Steve?
> 
> pmd_mknotpresent on arm looks inconsistent with the other
> architectures and can be changed.
> 
> Having had a look at the usage, I can't see it causing an immediate
> problem (that needs to be addressed by an emergency patch).
> We don't have a notion of splitting pmds (so there is no splitting
> information to lose), and the only usage I could see of
> pmd_mknotpresent was:
> 
> pmdp_invalidate(vma, haddr, pmd);
> pmd_populate(mm, pmd, pgtable);
> 
> In mm/huge_memory.c, around line 3588.
> 
> So we invalidate the entry (which puts down a faulting entry from
> pmd_mknotpresent and invalidates tlb), then immediately put down a
> table entry with pmd_populate.
> 
> I have run a 32-bit ARM test kernel and exacerbated THP splits (that's
> what took me time), and I didn't notice any problems with 4.5-rc5.

If I read code correctly, your pmd_mknotpresent() makes the pmd
pmd_none(), right? If yes, it's a problem.

It introduces race I've described here:

https://marc.info/?l=linux-mm&m=144723658100512&w=4

Basically, if zap_pmd_range() would see pmd_none() between
pmdp_mknotpresent() and pmd_populate(), we're screwed.

The race window is small, but it's there.

-- 
 Kirill A. Shutemov
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RESEND PATCH v3 1/2] mm: Introducing arch_remap hook

2015-04-13 Thread Kirill A. Shutemov
On Mon, Apr 13, 2015 at 11:56:27AM +0200, Laurent Dufour wrote:
> Some architecture would like to be triggered when a memory area is moved
> through the mremap system call.
> 
> This patch is introducing a new arch_remap mm hook which is placed in the
> path of mremap, and is called before the old area is unmapped (and the
> arch_unmap hook is called).
> 
> The architectures which need to call this hook should define
> __HAVE_ARCH_REMAP in their asm/mmu_context.h and provide the arch_remap
> service with the following prototype:
> void arch_remap(struct mm_struct *mm,
> unsigned long old_start, unsigned long old_end,
> unsigned long new_start, unsigned long new_end);
> 
> Signed-off-by: Laurent Dufour 
> Reviewed-by: Ingo Molnar 
> ---
>  mm/mremap.c | 19 +--
>  1 file changed, 13 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 2dc44b1cb1df..009db5565893 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -25,6 +25,7 @@
>  
>  #include 
>  #include 
> +#include 
>  
>  #include "internal.h"
>  
> @@ -286,13 +287,19 @@ static unsigned long move_vma(struct vm_area_struct 
> *vma,
>   old_len = new_len;
>   old_addr = new_addr;
>   new_addr = -ENOMEM;
> - } else if (vma->vm_file && vma->vm_file->f_op->mremap) {
> - err = vma->vm_file->f_op->mremap(vma->vm_file, new_vma);
> - if (err < 0) {
> - move_page_tables(new_vma, new_addr, vma, old_addr,
> -  moved_len, true);
> - return err;
> + } else {
> + if (vma->vm_file && vma->vm_file->f_op->mremap) {
> + err = vma->vm_file->f_op->mremap(vma->vm_file, new_vma);
> + if (err < 0) {
> + move_page_tables(new_vma, new_addr, vma,
> +   old_addr, moved_len, true);
> + return err;
> + }
>   }
> +#ifdef __HAVE_ARCH_REMAP

It would be cleaner to provide dummy arch_remap() for !__HAVE_ARCH_REMAP
in some generic header.


> + arch_remap(mm, old_addr, old_addr+old_len,
> +new_addr, new_addr+new_len);

Spaces around '+'?

> +#endif
>   }
>  
>   /* Conceal VM_ACCOUNT so old reservation is not undone */
> -- 
> 1.9.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
 Kirill A. Shutemov
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RESEND PATCH v3 1/2] mm: Introducing arch_remap hook

2015-04-13 Thread Kirill A. Shutemov
On Mon, Apr 13, 2015 at 02:41:22PM +0200, Laurent Dufour wrote:
> On 13/04/2015 13:58, Kirill A. Shutemov wrote:
> > On Mon, Apr 13, 2015 at 11:56:27AM +0200, Laurent Dufour wrote:
> >> Some architecture would like to be triggered when a memory area is moved
> >> through the mremap system call.
> >>
> >> This patch is introducing a new arch_remap mm hook which is placed in the
> >> path of mremap, and is called before the old area is unmapped (and the
> >> arch_unmap hook is called).
> >>
> >> The architectures which need to call this hook should define
> >> __HAVE_ARCH_REMAP in their asm/mmu_context.h and provide the arch_remap
> >> service with the following prototype:
> >> void arch_remap(struct mm_struct *mm,
> >> unsigned long old_start, unsigned long old_end,
> >> unsigned long new_start, unsigned long new_end);
> >>
> >> Signed-off-by: Laurent Dufour 
> >> Reviewed-by: Ingo Molnar 
> >> ---
> >>  mm/mremap.c | 19 +--
> >>  1 file changed, 13 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/mm/mremap.c b/mm/mremap.c
> >> index 2dc44b1cb1df..009db5565893 100644
> >> --- a/mm/mremap.c
> >> +++ b/mm/mremap.c
> >> @@ -25,6 +25,7 @@
> >>  
> >>  #include 
> >>  #include 
> >> +#include 
> >>  
> >>  #include "internal.h"
> >>  
> >> @@ -286,13 +287,19 @@ static unsigned long move_vma(struct vm_area_struct 
> >> *vma,
> >>old_len = new_len;
> >>old_addr = new_addr;
> >>new_addr = -ENOMEM;
> >> -  } else if (vma->vm_file && vma->vm_file->f_op->mremap) {
> >> -  err = vma->vm_file->f_op->mremap(vma->vm_file, new_vma);
> >> -  if (err < 0) {
> >> -  move_page_tables(new_vma, new_addr, vma, old_addr,
> >> -   moved_len, true);
> >> -  return err;
> >> +  } else {
> >> +  if (vma->vm_file && vma->vm_file->f_op->mremap) {
> >> +  err = vma->vm_file->f_op->mremap(vma->vm_file, new_vma);
> >> +  if (err < 0) {
> >> +  move_page_tables(new_vma, new_addr, vma,
> >> +old_addr, moved_len, true);
> >> +  return err;
> >> +  }
> >>}
> >> +#ifdef __HAVE_ARCH_REMAP
> > 
> > It would be cleaner to provide dummy arch_remap() for !__HAVE_ARCH_REMAP
> > in some generic header.
> 
> The idea was to not impact all the architectures as arch_unmap(),
> arch_dup_mmap() or arch_exit_mmap() implies.
> 
> I look at the headers where such a dummy arch_remap could be put but I
> can't figure out one which will not impact all the architecture.
> What about defining a dummy service earlier in mm/remap.c in the case
> __HAVE_ARCH_REMAP is not defined ?
> Something like :
> #ifndef __HAVE_ARCH_REMAP
> static inline void void arch_remap(struct mm_struct *mm,
>unsigned long old_start,
>unsigned long old_end,
>unsigned long new_start,
>unsigned long new_end)
> {
> }
> #endif

Or just #define arch_remap(...) do { } while (0)

-- 
 Kirill A. Shutemov
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RESEND PATCH v3 1/2] mm: Introducing arch_remap hook

2015-04-13 Thread Kirill A. Shutemov
On Mon, Apr 13, 2015 at 04:35:21PM +0300, Pavel Emelyanov wrote:
> On 04/13/2015 04:21 PM, Laurent Dufour wrote:
> > On 13/04/2015 15:13, Kirill A. Shutemov wrote:
> >> On Mon, Apr 13, 2015 at 02:41:22PM +0200, Laurent Dufour wrote:
> >>> On 13/04/2015 13:58, Kirill A. Shutemov wrote:
> >>>> On Mon, Apr 13, 2015 at 11:56:27AM +0200, Laurent Dufour wrote:
> >>>>> Some architecture would like to be triggered when a memory area is moved
> >>>>> through the mremap system call.
> >>>>>
> >>>>> This patch is introducing a new arch_remap mm hook which is placed in 
> >>>>> the
> >>>>> path of mremap, and is called before the old area is unmapped (and the
> >>>>> arch_unmap hook is called).
> >>>>>
> >>>>> The architectures which need to call this hook should define
> >>>>> __HAVE_ARCH_REMAP in their asm/mmu_context.h and provide the arch_remap
> >>>>> service with the following prototype:
> >>>>> void arch_remap(struct mm_struct *mm,
> >>>>> unsigned long old_start, unsigned long old_end,
> >>>>> unsigned long new_start, unsigned long new_end);
> >>>>>
> >>>>> Signed-off-by: Laurent Dufour 
> >>>>> Reviewed-by: Ingo Molnar 
> >>>>> ---
> >>>>>  mm/mremap.c | 19 +--
> >>>>>  1 file changed, 13 insertions(+), 6 deletions(-)
> >>>>>
> >>>>> diff --git a/mm/mremap.c b/mm/mremap.c
> >>>>> index 2dc44b1cb1df..009db5565893 100644
> >>>>> --- a/mm/mremap.c
> >>>>> +++ b/mm/mremap.c
> >>>>> @@ -25,6 +25,7 @@
> >>>>>  
> >>>>>  #include 
> >>>>>  #include 
> >>>>> +#include 
> >>>>>  
> >>>>>  #include "internal.h"
> >>>>>  
> >>>>> @@ -286,13 +287,19 @@ static unsigned long move_vma(struct 
> >>>>> vm_area_struct *vma,
> >>>>> old_len = new_len;
> >>>>> old_addr = new_addr;
> >>>>> new_addr = -ENOMEM;
> >>>>> -   } else if (vma->vm_file && vma->vm_file->f_op->mremap) {
> >>>>> -   err = vma->vm_file->f_op->mremap(vma->vm_file, new_vma);
> >>>>> -   if (err < 0) {
> >>>>> -   move_page_tables(new_vma, new_addr, vma, 
> >>>>> old_addr,
> >>>>> -moved_len, true);
> >>>>> -   return err;
> >>>>> +   } else {
> >>>>> +   if (vma->vm_file && vma->vm_file->f_op->mremap) {
> >>>>> +   err = vma->vm_file->f_op->mremap(vma->vm_file, 
> >>>>> new_vma);
> >>>>> +   if (err < 0) {
> >>>>> +   move_page_tables(new_vma, new_addr, vma,
> >>>>> + old_addr, moved_len, 
> >>>>> true);
> >>>>> +   return err;
> >>>>> +   }
> >>>>> }
> >>>>> +#ifdef __HAVE_ARCH_REMAP
> >>>>
> >>>> It would be cleaner to provide dummy arch_remap() for !__HAVE_ARCH_REMAP
> >>>> in some generic header.
> >>>
> >>> The idea was to not impact all the architectures as arch_unmap(),
> >>> arch_dup_mmap() or arch_exit_mmap() implies.
> >>>
> >>> I look at the headers where such a dummy arch_remap could be put but I
> >>> can't figure out one which will not impact all the architecture.
> >>> What about defining a dummy service earlier in mm/remap.c in the case
> >>> __HAVE_ARCH_REMAP is not defined ?
> >>> Something like :
> >>> #ifndef __HAVE_ARCH_REMAP
> >>> static inline void void arch_remap(struct mm_struct *mm,
> >>>unsigned long old_start,
> >>>unsigned long old_end,
> >>>unsigned long new_start,
> >>>unsigned long new_end)
> >>> {
> >>> }
> >>> #endif
> >>
> >> Or just #define arch_remap(...) do { } while (0)
> >>
> > 
> > I guessed you wanted the arch_remap() prototype to be exposed somewhere
> > in the code.
> > 
> > To be honest, I can't find the benefit of defining a dummy arch_remap()
> > in mm/remap.c if __HAVE_ARCH_REMAP is not defined instead of calling it
> > in move_vma if __HAVE_ARCH_REMAP is defined.
> > Is it really what you want ?
> 
> I think Kirill meant something like e.g. the arch_enter_lazy_mmu_mode()
> is implemented and called in mm/mremap.c -- the "generic" part is in the
> include/asm-generic/pgtable.h and those architectures willing to have
> their own implementation are in arch/$arch/...
> 
> Kirill, if I'm right with it, can you suggest the header where to put
> the "generic" mremap hook's (empty) body?

I initially thought it would be enough to put it into
, expecting it works as
. But that's not the case.

It probably worth at some point rework all  to include
 at the end as we do for .
But that's outside the scope of the patchset, I guess.

I don't see any better candidate for such dummy header. :-/

-- 
 Kirill A. Shutemov
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RESEND PATCH v3 1/2] mm: Introducing arch_remap hook

2015-04-13 Thread Kirill A. Shutemov
On Mon, Apr 13, 2015 at 04:11:19PM +0200, Laurent Dufour wrote:
> On 13/04/2015 16:02, Kirill A. Shutemov wrote:
> > On Mon, Apr 13, 2015 at 04:35:21PM +0300, Pavel Emelyanov wrote:
> >> On 04/13/2015 04:21 PM, Laurent Dufour wrote:
> >>> On 13/04/2015 15:13, Kirill A. Shutemov wrote:
> >>>> On Mon, Apr 13, 2015 at 02:41:22PM +0200, Laurent Dufour wrote:
> >>>>> On 13/04/2015 13:58, Kirill A. Shutemov wrote:
> >>>>>> On Mon, Apr 13, 2015 at 11:56:27AM +0200, Laurent Dufour wrote:
> >>>>>>> Some architecture would like to be triggered when a memory area is 
> >>>>>>> moved
> >>>>>>> through the mremap system call.
> >>>>>>>
> >>>>>>> This patch is introducing a new arch_remap mm hook which is placed in 
> >>>>>>> the
> >>>>>>> path of mremap, and is called before the old area is unmapped (and the
> >>>>>>> arch_unmap hook is called).
> >>>>>>>
> >>>>>>> The architectures which need to call this hook should define
> >>>>>>> __HAVE_ARCH_REMAP in their asm/mmu_context.h and provide the 
> >>>>>>> arch_remap
> >>>>>>> service with the following prototype:
> >>>>>>> void arch_remap(struct mm_struct *mm,
> >>>>>>> unsigned long old_start, unsigned long old_end,
> >>>>>>> unsigned long new_start, unsigned long new_end);
> >>>>>>>
> >>>>>>> Signed-off-by: Laurent Dufour 
> >>>>>>> Reviewed-by: Ingo Molnar 
> >>>>>>> ---
> >>>>>>>  mm/mremap.c | 19 +--
> >>>>>>>  1 file changed, 13 insertions(+), 6 deletions(-)
> >>>>>>>
> >>>>>>> diff --git a/mm/mremap.c b/mm/mremap.c
> >>>>>>> index 2dc44b1cb1df..009db5565893 100644
> >>>>>>> --- a/mm/mremap.c
> >>>>>>> +++ b/mm/mremap.c
> >>>>>>> @@ -25,6 +25,7 @@
> >>>>>>>  
> >>>>>>>  #include 
> >>>>>>>  #include 
> >>>>>>> +#include 
> >>>>>>>  
> >>>>>>>  #include "internal.h"
> >>>>>>>  
> >>>>>>> @@ -286,13 +287,19 @@ static unsigned long move_vma(struct 
> >>>>>>> vm_area_struct *vma,
> >>>>>>>   old_len = new_len;
> >>>>>>>   old_addr = new_addr;
> >>>>>>>   new_addr = -ENOMEM;
> >>>>>>> - } else if (vma->vm_file && vma->vm_file->f_op->mremap) {
> >>>>>>> - err = vma->vm_file->f_op->mremap(vma->vm_file, new_vma);
> >>>>>>> - if (err < 0) {
> >>>>>>> - move_page_tables(new_vma, new_addr, vma, 
> >>>>>>> old_addr,
> >>>>>>> -  moved_len, true);
> >>>>>>> - return err;
> >>>>>>> + } else {
> >>>>>>> + if (vma->vm_file && vma->vm_file->f_op->mremap) {
> >>>>>>> + err = vma->vm_file->f_op->mremap(vma->vm_file, 
> >>>>>>> new_vma);
> >>>>>>> + if (err < 0) {
> >>>>>>> + move_page_tables(new_vma, new_addr, vma,
> >>>>>>> +   old_addr, moved_len, 
> >>>>>>> true);
> >>>>>>> + return err;
> >>>>>>> + }
> >>>>>>>   }
> >>>>>>> +#ifdef __HAVE_ARCH_REMAP
> >>>>>>
> >>>>>> It would be cleaner to provide dummy arch_remap() for 
> >>>>>> !__HAVE_ARCH_REMAP
> >>>>>> in some generic header.
> >>>>>
> >>>>> The idea was to not impact all the architectures as arch_unmap(),
> >>>>> arch_dup_mmap() or arch_exit_mmap() implies.
> >&

Re: [RFC PATCH] mm/thp: Use new function to clear pmd before THP splitting

2015-05-05 Thread Kirill A. Shutemov
On Mon, May 04, 2015 at 10:59:16PM +0530, Aneesh Kumar K.V wrote:
> Archs like ppc64 require pte_t * to remain stable in some code path.
> They use local_irq_disable to prevent a parallel split. Generic code
> clear pmd instead of marking it _PAGE_SPLITTING in code path
> where we can afford to mark pmd none before splitting. Use a
> variant of pmdp_splitting_clear_notify that arch can override.
> 
> Signed-off-by: Aneesh Kumar K.V 

Sorry, I still try wrap my head around this problem.

So, Power has __find_linux_pte_or_hugepte() which does lock-less lookup in
page tables with local interrupts disabled. For huge pages it casts pmd_t
to pte_t. Since format of pte_t is different from pmd_t we want to prevent
transit from pmd pointing to page table to pmd pinging to huge page (and
back) while interrupts are disabled.

The complication for Power is that it doesn't do implicit IPI on tlb
flush.

Is it correct?

For THP, split_huge_page() and collapse sides are covered. This patch
should address two cases of splitting PMD, but not compound page in
current upstream.

But I think there's still *big* problem for Power -- zap_huge_pmd().

For instance: other CPU can shoot out a THP PMD with MADV_DONTNEED and
fault in small pages instead. IIUC, for __find_linux_pte_or_hugepte(),
it's equivalent of splitting.

I don't see how this can be fixed without kick_all_cpus_sync() in all
pmdp_clear_flush() on Power.

-- 
 Kirill A. Shutemov
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V2 1/2] mm/thp: Split out pmd collpase flush into a seperate functions

2015-05-07 Thread Kirill A. Shutemov
ng new HPTE entries for normal subpages.
> +  * That means we could be modifying the page content as we
> +  * copy them to a huge page. So wait for parallel hash_page
> +  * to finish before invalidating HPTE entries. We can do this
> +  * by sending an IPI to all the cpus and executing a dummy
> +  * function there.
> +  */
> + kick_all_cpus_sync();
> + /*
> +  * Now invalidate the hpte entries in the range
> +  * covered by pmd. This make sure we take a
> +  * fault and will find the pmd as none, which will
> +  * result in a major fault which takes mmap_sem and
> +  * hence wait for collapse to complete. Without this
> +  * the __collapse_huge_page_copy can result in copying
> +  * the old content.
> +  */
> + flush_tlb_pmd_range(vma->vm_mm, &pmd, address);
>   return pmd;
>  }
>  
> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> index 39f1d6a2b04d..80e6d415cd57 100644
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -189,6 +189,25 @@ extern void pmdp_splitting_flush(struct vm_area_struct 
> *vma,
>unsigned long address, pmd_t *pmdp);
>  #endif
>  
> +#ifndef __HAVE_ARCH_PMDP_COLLAPSE_FLUSH
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +static inline pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
> +unsigned long address,
> +pmd_t *pmdp)
> +{
> + return pmdp_clear_flush(vma, address, pmdp);
> +}
> +#else
> +static inline pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
> +unsigned long address,
> +pmd_t *pmdp)
> +{
> + BUILD_BUG();
> + return __pmd(0);
> +}
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> +#endif
> +
>  #ifndef __HAVE_ARCH_PGTABLE_DEPOSIT
>  extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
>  pgtable_t pgtable);
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 078832cf3636..88f695a4e38b 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2499,7 +2499,7 @@ static void collapse_huge_page(struct mm_struct *mm,
>* huge and small TLB entries for the same virtual address
>* to avoid the risk of CPU bugs in that area.
>*/
> - _pmd = pmdp_clear_flush(vma, address, pmd);
> +     _pmd = pmdp_collapse_flush(vma, address, pmd);

Why? pmdp_clear_flush() does kick_all_cpus_sync() already.

>   spin_unlock(pmd_ptl);
>   mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
>  
> -- 
> 2.1.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
 Kirill A. Shutemov
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V3] powerpc/thp: Serialize pmd clear against a linux page table walk.

2015-05-11 Thread Kirill A. Shutemov
On Mon, May 11, 2015 at 11:56:01AM +0530, Aneesh Kumar K.V wrote:
> Serialize against find_linux_pte_or_hugepte which does lock-less
> lookup in page tables with local interrupts disabled. For huge pages
> it casts pmd_t to pte_t. Since format of pte_t is different from
> pmd_t we want to prevent transit from pmd pointing to page table
> to pmd pointing to huge page (and back) while interrupts are disabled.
> We clear pmd to possibly replace it with page table pointer in
> different code paths. So make sure we wait for the parallel
> find_linux_pte_or_hugepage to finish.
> 
> Without this patch, a find_linux_pte_or_hugepte running in parallel to
> __split_huge_zero_page_pmd or do_huge_pmd_wp_page_fallback or zap_huge_pmd
> can run into the above issue. With __split_huge_zero_page_pmd and
> do_huge_pmd_wp_page_fallback we clear the hugepage pte before inserting
> the pmd entry with a regular pgtable address. Such a clear need to
> wait for the parallel find_linux_pte_or_hugepte to finish.
> 
> With zap_huge_pmd, we can run into issues, with a hugepage pte
> getting zapped due to a MADV_DONTNEED while other cpu fault it
> in as small pages.
> 
> Reported-by: Kirill A. Shutemov 
> Signed-off-by: Aneesh Kumar K.V 

Reviewed-by: Kirill A. Shutemov 

CC: stable@ ?

-- 
 Kirill A. Shutemov
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V4 1/3] mm/thp: Split out pmd collpase flush into a separate functions

2015-05-12 Thread Kirill A. Shutemov
On Tue, May 12, 2015 at 11:38:32AM +0530, Aneesh Kumar K.V wrote:
> Architectures like ppc64 [1] need to do special things while clearing
> pmd before a collapse. For them this operation is largely different
> from a normal hugepage pte clear. Hence add a separate function
> to clear pmd before collapse. After this patch pmdp_* functions
> operate only on hugepage pte, and not on regular pmd_t values
> pointing to page table.
> 
> [1] ppc64 needs to invalidate all the normal page pte mappings we
> already have inserted in the hardware hash page table. But before
> doing that we need to make sure there are no parallel hash page
> table insert going on. So we need to do a kick_all_cpus_sync()
> before flushing the older hash table entries. By moving this to
> a separate function we capture these details and mention how it
> is different from a hugepage pte clear.
> 
> This patch is a cleanup and only does code movement for clarity.
> There should not be any change in functionality.
> 
> Signed-off-by: Aneesh Kumar K.V 

For the patchset:

Acked-by: Kirill A. Shutemov 

-- 
 Kirill A. Shutemov
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V4 5/6] mm: mmap: Add mmap flag to request VM_LOCKONFAULT

2015-07-22 Thread Kirill A. Shutemov
On Tue, Jul 21, 2015 at 03:59:40PM -0400, Eric B Munson wrote:
> The cost of faulting in all memory to be locked can be very high when
> working with large mappings.  If only portions of the mapping will be
> used this can incur a high penalty for locking.
> 
> Now that we have the new VMA flag for the locked but not present state,
> expose it as an mmap option like MAP_LOCKED -> VM_LOCKED.

What is advantage over mmap() + mlock(MLOCK_ONFAULT)?

-- 
 Kirill A. Shutemov
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V4 5/6] mm: mmap: Add mmap flag to request VM_LOCKONFAULT

2015-07-22 Thread Kirill A. Shutemov
On Wed, Jul 22, 2015 at 10:32:20AM -0400, Eric B Munson wrote:
> On Wed, 22 Jul 2015, Kirill A. Shutemov wrote:
> 
> > On Tue, Jul 21, 2015 at 03:59:40PM -0400, Eric B Munson wrote:
> > > The cost of faulting in all memory to be locked can be very high when
> > > working with large mappings.  If only portions of the mapping will be
> > > used this can incur a high penalty for locking.
> > > 
> > > Now that we have the new VMA flag for the locked but not present state,
> > > expose it as an mmap option like MAP_LOCKED -> VM_LOCKED.
> > 
> > What is advantage over mmap() + mlock(MLOCK_ONFAULT)?
> 
> There isn't one, it was added to maintain parity with the
> mlock(MLOCK_LOCK) -> mmap(MAP_LOCKED) set.  I think not having will lead
> to confusion because we have MAP_LOCKED so why don't we support
> LOCKONFAULT from mmap as well.

I don't think it's ia good idea to spend bits in flags unless we have a
reason for that.

BTW, you have typo on sparc: s/0x8000/0x8/.


-- 
 Kirill A. Shutemov
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V5 2/7] mm: mlock: Add new mlock system call

2015-07-26 Thread Kirill A. Shutemov
On Fri, Jul 24, 2015 at 05:28:40PM -0400, Eric B Munson wrote:
> With the refactored mlock code, introduce a new system call for mlock.
> The new call will allow the user to specify what lock states are being
> added.  mlock2 is trivial at the moment, but a follow on patch will add
> a new mlock state making it useful.
> 
> Signed-off-by: Eric B Munson 
> Cc: Michal Hocko 
> Cc: Vlastimil Babka 
> Cc: Heiko Carstens 
> Cc: Geert Uytterhoeven 
> Cc: Catalin Marinas 
> Cc: Stephen Rothwell 
> Cc: Guenter Roeck 
> Cc: linux-al...@vger.kernel.org
> Cc: linux-ker...@vger.kernel.org
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: adi-buildroot-de...@lists.sourceforge.net
> Cc: linux-cris-ker...@axis.com
> Cc: linux-i...@vger.kernel.org
> Cc: linux-m...@lists.linux-m68k.org
> Cc: linux-am33-l...@redhat.com
> Cc: linux-par...@vger.kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-s...@vger.kernel.org
> Cc: linux...@vger.kernel.org
> Cc: sparcli...@vger.kernel.org
> Cc: linux-xte...@linux-xtensa.org
> Cc: linux-...@vger.kernel.org
> Cc: linux-a...@vger.kernel.org
> Cc: linux...@kvack.org
> ---
> Changes from V4:
> * Drop all architectures except x86[_64] from this patch, MIPS is added
>   later in the series.  All others will be left to their maintainers.
> 
> Changes from V3:
> * Do a (hopefully) complete job of adding the new system calls
>  arch/alpha/include/uapi/asm/mman.h | 2 ++
>  arch/mips/include/uapi/asm/mman.h  | 5 +
>  arch/parisc/include/uapi/asm/mman.h| 2 ++
>  arch/powerpc/include/uapi/asm/mman.h   | 2 ++
>  arch/sparc/include/uapi/asm/mman.h | 2 ++
>  arch/tile/include/uapi/asm/mman.h  | 5 +
>  arch/x86/entry/syscalls/syscall_32.tbl | 1 +
>  arch/x86/entry/syscalls/syscall_64.tbl | 1 +
>  arch/xtensa/include/uapi/asm/mman.h| 5 +

Define MLOCK_LOCKED in include/uapi/asm-generic/mman-common.h.
This way you can drop changes in powerpc, sparc and tile.

Otherwise looks good.

-- 
 Kirill A. Shutemov
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V5 4/7] mm: mlock: Add mlock flags to enable VM_LOCKONFAULT usage

2015-07-27 Thread Kirill A. Shutemov
On Fri, Jul 24, 2015 at 05:28:42PM -0400, Eric B Munson wrote:
> The previous patch introduced a flag that specified pages in a VMA
> should be placed on the unevictable LRU, but they should not be made
> present when the area is created.  This patch adds the ability to set
> this state via the new mlock system calls.
> 
> We add MLOCK_ONFAULT for mlock2 and MCL_ONFAULT for mlockall.
> MLOCK_ONFAULT will set the VM_LOCKONFAULT flag as well as the VM_LOCKED
> flag for the target region.  MCL_CURRENT and MCL_ONFAULT are used to
> lock current mappings.  With MCL_CURRENT all pages are made present and
> with MCL_ONFAULT they are locked when faulted in.  When specified with
> MCL_FUTURE all new mappings will be marked with VM_LOCKONFAULT.
> 
> Currently, mlockall() clears all VMA lock flags and then sets the
> requested flags.  For instance, if a process has MCL_FUTURE and
> MCL_CURRENT set, but they want to clear MCL_FUTURE this would be
> accomplished by calling mlockall(MCL_CURRENT).  This still holds with
> the introduction of MCL_ONFAULT.  Each call to mlockall() resets all
> VMA flags to the values specified in the current call.  The new mlock2
> system call behaves in the same way.  If a region is locked with
> MLOCK_ONFAULT and a user wants to force it to be populated now, a second
> call to mlock2(MLOCK_LOCKED) will accomplish this.
> 
> munlock() will unconditionally clear both vma flags.  munlockall()
> unconditionally clears for VMA flags on all VMAs and in the
> mm->def_flags field.
> 
> Signed-off-by: Eric B Munson 
> Cc: Michal Hocko 
> Cc: Vlastimil Babka 
> Cc: Jonathan Corbet 
> Cc: "Kirill A. Shutemov" 
> Cc: linux-al...@vger.kernel.org
> Cc: linux-ker...@vger.kernel.org
> Cc: linux-m...@linux-mips.org
> Cc: linux-par...@vger.kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: sparcli...@vger.kernel.org
> Cc: linux-xte...@linux-xtensa.org
> Cc: linux-a...@vger.kernel.org
> Cc: linux-...@vger.kernel.org
> Cc: linux...@kvack.org
> ---
> Changes from V4:
> * Split addition of VMA flag
> 
> Changes from V3:
> * Do extensive search for VM_LOCKED and ensure that VM_LOCKONFAULT is also 
> handled
>  where appropriate
>  arch/alpha/include/uapi/asm/mman.h   |  2 ++
>  arch/mips/include/uapi/asm/mman.h|  2 ++
>  arch/parisc/include/uapi/asm/mman.h  |  2 ++
>  arch/powerpc/include/uapi/asm/mman.h |  2 ++
>  arch/sparc/include/uapi/asm/mman.h   |  2 ++
>  arch/tile/include/uapi/asm/mman.h    |  3 +++
>  arch/xtensa/include/uapi/asm/mman.h  |  2 ++

Again, you can save few lines by moving some code into mman-common.h.

Otherwise looks good.

-- 
 Kirill A. Shutemov
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V5 5/7] mm: mmap: Add mmap flag to request VM_LOCKONFAULT

2015-07-27 Thread Kirill A. Shutemov
On Fri, Jul 24, 2015 at 05:28:43PM -0400, Eric B Munson wrote:
> The cost of faulting in all memory to be locked can be very high when
> working with large mappings.  If only portions of the mapping will be
> used this can incur a high penalty for locking.
> 
> Now that we have the new VMA flag for the locked but not present state,
> expose it as an mmap option like MAP_LOCKED -> VM_LOCKED.

As I mentioned before, I don't think this interface is justified.

MAP_LOCKED has known issues[1]. The MAP_LOCKED problem is not necessary
affects MAP_LOCKONFAULT, but still.

Let's not add new interface unless it's demonstrably useful.

[1] http://lkml.kernel.org/g/20150114095019.gc4...@dhcp22.suse.cz

-- 
 Kirill A. Shutemov
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V5 5/7] mm: mmap: Add mmap flag to request VM_LOCKONFAULT

2015-07-27 Thread Kirill A. Shutemov
On Mon, Jul 27, 2015 at 09:41:26AM -0400, Eric B Munson wrote:
> On Mon, 27 Jul 2015, Kirill A. Shutemov wrote:
> 
> > On Fri, Jul 24, 2015 at 05:28:43PM -0400, Eric B Munson wrote:
> > > The cost of faulting in all memory to be locked can be very high when
> > > working with large mappings.  If only portions of the mapping will be
> > > used this can incur a high penalty for locking.
> > > 
> > > Now that we have the new VMA flag for the locked but not present state,
> > > expose it as an mmap option like MAP_LOCKED -> VM_LOCKED.
> > 
> > As I mentioned before, I don't think this interface is justified.
> > 
> > MAP_LOCKED has known issues[1]. The MAP_LOCKED problem is not necessary
> > affects MAP_LOCKONFAULT, but still.
> > 
> > Let's not add new interface unless it's demonstrably useful.
> > 
> > [1] http://lkml.kernel.org/g/20150114095019.gc4...@dhcp22.suse.cz
> 
> I understand and should have been more explicit.  This patch is still
> included becuase I have an internal user that wants to see it added.
> The problem discussed in the thread you point out does not affect
> MAP_LOCKONFAULT because we do not attempt to populate the region with
> MAP_LOCKONFAULT.
> 
> As I told Vlastimil, if this is a hard NAK with the patch I can work
> with that.  Otherwise I prefer it stays.

That's not how it works.

Once an ABI added to the kernel it stays there practically forever.
Therefore it must be useful to justify maintenance cost. I don't see it
demonstrated.

So, NAK.

-- 
 Kirill A. Shutemov
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] powerpc/mm: Fix Multi hit ERAT cause by recent THP update

2016-02-05 Thread Kirill A. Shutemov
 generic code that rely on IRQ disabling
> +  * to prevent a parallel THP split work as expected.
> +  */
> + kick_all_cpus_sync();
>  }
>  
>  /*
> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> index 0b3c0d39ef75..388065c79795 100644
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -239,6 +239,14 @@ extern void pmdp_invalidate(struct vm_area_struct *vma, 
> unsigned long address,
>   pmd_t *pmdp);
>  #endif
>  
> +#ifndef __HAVE_ARCH_PMDP_HUGE_SPLITTING_FLUSH
> +static inline void pmdp_huge_splitting_flush(struct vm_area_struct *vma,
> +  unsigned long address, pmd_t *pmdp)
> +{
> + return;
> +}
> +#endif
> +
>  #ifndef __HAVE_ARCH_PTE_SAME
>  static inline int pte_same(pte_t pte_a, pte_t pte_b)
>  {
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 36c070167b71..b52d16a86e91 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2860,6 +2860,7 @@ static void __split_huge_pmd_locked(struct 
> vm_area_struct *vma, pmd_t *pmd,
>   young = pmd_young(*pmd);
>   dirty = pmd_dirty(*pmd);
>  
> + pmdp_huge_splitting_flush(vma, haddr, pmd);
>   pgtable = pgtable_trans_huge_withdraw(mm, pmd);
>   pmd_populate(mm, &_pmd, pgtable);
>  
> -- 
> 2.5.0
> 

-- 
 Kirill A. Shutemov
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V2] powerpc/mm: Fix Multi hit ERAT cause by recent THP update

2016-02-07 Thread Kirill A. Shutemov
ess, pmdp, _PAGE_USER, 0);
> +}
> +
> +
>  /*
>   * set a new huge pmd. We should not be called for updating
>   * an existing pmd entry. That should go via pmd_hugepage_update.
> @@ -663,10 +687,19 @@ void set_pmd_at(struct mm_struct *mm, unsigned long 
> addr,
>   return set_pte_at(mm, addr, pmdp_ptep(pmdp), pmd_pte(pmd));
>  }
>  
> +/*
> + * We use this to invalidate a pmdp entry before switching from a
> + * hugepte to regular pmd entry.
> + */
>  void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
>pmd_t *pmdp)
>  {
> - pmd_hugepage_update(vma->vm_mm, address, pmdp, _PAGE_PRESENT, 0);
> + pmd_hugepage_update(vma->vm_mm, address, pmdp, ~0UL, 0);
> + /*
> +  * This ensures that generic code that rely on IRQ disabling
> +  * to prevent a parallel THP split work as expected.
> +  */
> + kick_all_cpus_sync();
>  }
>  
>  /*
> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> index 0b3c0d39ef75..93a0937652ec 100644
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -239,6 +239,14 @@ extern void pmdp_invalidate(struct vm_area_struct *vma, 
> unsigned long address,
>   pmd_t *pmdp);
>  #endif
>  
> +#ifndef __HAVE_ARCH_PMDP_HUGE_SPLITTING_FLUSH
> +static inline void pmdp_huge_splitting_flush(struct vm_area_struct *vma,
> +  unsigned long address, pmd_t *pmdp)
> +{
> +
> +}
> +#endif
> +
>  #ifndef __HAVE_ARCH_PTE_SAME
>  static inline int pte_same(pte_t pte_a, pte_t pte_b)
>  {
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 36c070167b71..b52d16a86e91 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2860,6 +2860,7 @@ static void __split_huge_pmd_locked(struct 
> vm_area_struct *vma, pmd_t *pmd,
>   young = pmd_young(*pmd);
>   dirty = pmd_dirty(*pmd);
>  
> + pmdp_huge_splitting_flush(vma, haddr, pmd);

Let's call it pmdp_huge_split_prepare().

"_flush" part is ppc-specific implementation detail and generic code
should not expect tlb to be flushed there.

Otherwise,

Acked-by: Kirill A. Shutemov 

>   pgtable = pgtable_trans_huge_withdraw(mm, pmd);
>   pmd_populate(mm, &_pmd, pgtable);
>  
> -- 
> 2.5.0
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org";> em...@kvack.org 

-- 
 Kirill A. Shutemov
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [BUG] random kernel crashes after THP rework on s390 (maybe also on PowerPC and ARM)

2016-02-11 Thread Kirill A. Shutemov
On Thu, Feb 11, 2016 at 07:22:23PM +0100, Gerald Schaefer wrote:
> Hi,
> 
> Sebastian Ott reported random kernel crashes beginning with v4.5-rc1 and
> he also bisected this to commit 61f5d698 "mm: re-enable THP". Further
> review of the THP rework patches, which cannot be bisected, revealed
> commit fecffad "s390, thp: remove infrastructure for handling splitting PMDs"
> (and also similar commits for other archs).
> 
> This commit removes the THP splitting bit and also the architecture
> implementation of pmdp_splitting_flush(), which took care of the IPI for
> fast_gup serialization. The commit message says
> 
> pmdp_splitting_flush() is not needed too: on splitting PMD we will do
> pmdp_clear_flush() + set_pte_at().  pmdp_clear_flush() will do IPI as
> needed for fast_gup
> 
> The assumption that a TLB flush will also produce an IPI is wrong on s390,
> and maybe also on other architectures, and I thought that this was actually
> the main reason for having an arch-specific pmdp_splitting_flush().
> 
> At least PowerPC and ARM also had an individual implementation of
> pmdp_splitting_flush() that used kick_all_cpus_sync() instead of a TLB
> flush to send the IPI, and those were also removed. Putting the arch
> maintainers and mailing lists on cc to verify.
> 
> On s390 this will break the IPI serialization against fast_gup, which
> would certainly explain the random kernel crashes, please revert or fix
> the pmdp_splitting_flush() removal.

Sorry for that.

I believe, the problem was already addressed for PowerPC:

http://lkml.kernel.org/g/454980831-16631-1-git-send-email-aneesh.ku...@linux.vnet.ibm.com

I think kick_all_cpus_sync() in arch-specific pmdp_invalidate() would do
the trick, right?

If yes, I'll prepare patch tomorrow (some sleep required).

-- 
 Kirill A. Shutemov
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [BUG] random kernel crashes after THP rework on s390 (maybe also on PowerPC and ARM)

2016-02-11 Thread Kirill A. Shutemov
On Thu, Feb 11, 2016 at 09:09:42PM +0200, Kirill A. Shutemov wrote:
> On Thu, Feb 11, 2016 at 07:22:23PM +0100, Gerald Schaefer wrote:
> > Hi,
> > 
> > Sebastian Ott reported random kernel crashes beginning with v4.5-rc1 and
> > he also bisected this to commit 61f5d698 "mm: re-enable THP". Further
> > review of the THP rework patches, which cannot be bisected, revealed
> > commit fecffad "s390, thp: remove infrastructure for handling splitting 
> > PMDs"
> > (and also similar commits for other archs).
> > 
> > This commit removes the THP splitting bit and also the architecture
> > implementation of pmdp_splitting_flush(), which took care of the IPI for
> > fast_gup serialization. The commit message says
> > 
> > pmdp_splitting_flush() is not needed too: on splitting PMD we will do
> > pmdp_clear_flush() + set_pte_at().  pmdp_clear_flush() will do IPI as
> > needed for fast_gup
> > 
> > The assumption that a TLB flush will also produce an IPI is wrong on s390,
> > and maybe also on other architectures, and I thought that this was actually
> > the main reason for having an arch-specific pmdp_splitting_flush().
> > 
> > At least PowerPC and ARM also had an individual implementation of
> > pmdp_splitting_flush() that used kick_all_cpus_sync() instead of a TLB
> > flush to send the IPI, and those were also removed. Putting the arch
> > maintainers and mailing lists on cc to verify.
> > 
> > On s390 this will break the IPI serialization against fast_gup, which
> > would certainly explain the random kernel crashes, please revert or fix
> > the pmdp_splitting_flush() removal.
> 
> Sorry for that.
> 
> I believe, the problem was already addressed for PowerPC:
> 
> http://lkml.kernel.org/g/454980831-16631-1-git-send-email-aneesh.ku...@linux.vnet.ibm.com

Correct link is

http://lkml.kernel.org/g/1454980831-16631-1-git-send-email-aneesh.ku...@linux.vnet.ibm.com

-- 
 Kirill A. Shutemov
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [BUG] random kernel crashes after THP rework on s390 (maybe also on PowerPC and ARM)

2016-02-12 Thread Kirill A. Shutemov
On Thu, Feb 11, 2016 at 08:57:02PM +0100, Gerald Schaefer wrote:
> On Thu, 11 Feb 2016 21:09:42 +0200
> "Kirill A. Shutemov"  wrote:
> 
> > On Thu, Feb 11, 2016 at 07:22:23PM +0100, Gerald Schaefer wrote:
> > > Hi,
> > > 
> > > Sebastian Ott reported random kernel crashes beginning with v4.5-rc1 and
> > > he also bisected this to commit 61f5d698 "mm: re-enable THP". Further
> > > review of the THP rework patches, which cannot be bisected, revealed
> > > commit fecffad "s390, thp: remove infrastructure for handling splitting 
> > > PMDs"
> > > (and also similar commits for other archs).
> > > 
> > > This commit removes the THP splitting bit and also the architecture
> > > implementation of pmdp_splitting_flush(), which took care of the IPI for
> > > fast_gup serialization. The commit message says
> > > 
> > > pmdp_splitting_flush() is not needed too: on splitting PMD we will do
> > > pmdp_clear_flush() + set_pte_at().  pmdp_clear_flush() will do IPI as
> > > needed for fast_gup
> > > 
> > > The assumption that a TLB flush will also produce an IPI is wrong on s390,
> > > and maybe also on other architectures, and I thought that this was 
> > > actually
> > > the main reason for having an arch-specific pmdp_splitting_flush().
> > > 
> > > At least PowerPC and ARM also had an individual implementation of
> > > pmdp_splitting_flush() that used kick_all_cpus_sync() instead of a TLB
> > > flush to send the IPI, and those were also removed. Putting the arch
> > > maintainers and mailing lists on cc to verify.
> > > 
> > > On s390 this will break the IPI serialization against fast_gup, which
> > > would certainly explain the random kernel crashes, please revert or fix
> > > the pmdp_splitting_flush() removal.
> > 
> > Sorry for that.
> > 
> > I believe, the problem was already addressed for PowerPC:
> > 
> > http://lkml.kernel.org/g/454980831-16631-1-git-send-email-aneesh.ku...@linux.vnet.ibm.com
> > 
> > I think kick_all_cpus_sync() in arch-specific pmdp_invalidate() would do
> > the trick, right?
> 
> Hmm, not sure about that. After pmdp_invalidate(), a pmd_none() check in
> fast_gup will still return false, because the pmd is not empty (at least
> on s390). So I don't see spontaneously how it will help fast_gup to break
> out to the slow path in case of THP splitting.

What pmdp_flush_direct() does in pmdp_invalidate()? It's hard to unwrap for me 
:-/
Does it make the pmd !pmd_present()?

I'm also confused by pmd_none() is equal to !pmd_present() on s390. Hm?

-- 
 Kirill A. Shutemov
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [BUG] random kernel crashes after THP rework on s390 (maybe also on PowerPC and ARM)

2016-02-12 Thread Kirill A. Shutemov
On Fri, Feb 12, 2016 at 06:16:40PM +0100, Gerald Schaefer wrote:
> On Fri, 12 Feb 2016 16:57:27 +0100
> Christian Borntraeger  wrote:
> 
> > On 02/12/2016 04:41 PM, Kirill A. Shutemov wrote:
> > > On Thu, Feb 11, 2016 at 08:57:02PM +0100, Gerald Schaefer wrote:
> > >> On Thu, 11 Feb 2016 21:09:42 +0200
> > >> "Kirill A. Shutemov"  wrote:
> > >>
> > >>> On Thu, Feb 11, 2016 at 07:22:23PM +0100, Gerald Schaefer wrote:
> > >>>> Hi,
> > >>>>
> > >>>> Sebastian Ott reported random kernel crashes beginning with v4.5-rc1 
> > >>>> and
> > >>>> he also bisected this to commit 61f5d698 "mm: re-enable THP". Further
> > >>>> review of the THP rework patches, which cannot be bisected, revealed
> > >>>> commit fecffad "s390, thp: remove infrastructure for handling 
> > >>>> splitting PMDs"
> > >>>> (and also similar commits for other archs).
> > >>>>
> > >>>> This commit removes the THP splitting bit and also the architecture
> > >>>> implementation of pmdp_splitting_flush(), which took care of the IPI 
> > >>>> for
> > >>>> fast_gup serialization. The commit message says
> > >>>>
> > >>>> pmdp_splitting_flush() is not needed too: on splitting PMD we will 
> > >>>> do
> > >>>> pmdp_clear_flush() + set_pte_at().  pmdp_clear_flush() will do IPI 
> > >>>> as
> > >>>> needed for fast_gup
> > >>>>
> > >>>> The assumption that a TLB flush will also produce an IPI is wrong on 
> > >>>> s390,
> > >>>> and maybe also on other architectures, and I thought that this was 
> > >>>> actually
> > >>>> the main reason for having an arch-specific pmdp_splitting_flush().
> > >>>>
> > >>>> At least PowerPC and ARM also had an individual implementation of
> > >>>> pmdp_splitting_flush() that used kick_all_cpus_sync() instead of a TLB
> > >>>> flush to send the IPI, and those were also removed. Putting the arch
> > >>>> maintainers and mailing lists on cc to verify.
> > >>>>
> > >>>> On s390 this will break the IPI serialization against fast_gup, which
> > >>>> would certainly explain the random kernel crashes, please revert or fix
> > >>>> the pmdp_splitting_flush() removal.
> > >>>
> > >>> Sorry for that.
> > >>>
> > >>> I believe, the problem was already addressed for PowerPC:
> > >>>
> > >>> http://lkml.kernel.org/g/454980831-16631-1-git-send-email-aneesh.ku...@linux.vnet.ibm.com
> > >>>
> > >>> I think kick_all_cpus_sync() in arch-specific pmdp_invalidate() would do
> > >>> the trick, right?
> > >>
> > >> Hmm, not sure about that. After pmdp_invalidate(), a pmd_none() check in
> > >> fast_gup will still return false, because the pmd is not empty (at least
> > >> on s390). So I don't see spontaneously how it will help fast_gup to break
> > >> out to the slow path in case of THP splitting.
> > > 
> > > What pmdp_flush_direct() does in pmdp_invalidate()? It's hard to unwrap 
> > > for me :-/
> > > Does it make the pmd !pmd_present()?
> > 
> > It uses the idte instruction, which in an atomic fashion flushes the 
> > associated
> > TLB entry and changes the value of the pmd entry to invalid. This comes 
> > from the
> > HW requirement to not  change a PTE/PMD that might be still in use, other 
> > than 
> > with special instructions that does the tlb handling and the invalidation 
> > together.
> 
> Correct, and it does _not_ make the pmd !pmd_present(), that would only be the
> case after a _clear_flush(). It only marks the pmd as invalid and flushes,
> so that it cannot generate a new TLB entry before the following 
> pmd_populate(),
> but it keeps its other content. This is to fulfill the requirements outlined 
> in
> the comment in mm/huge_memory.c before the call to pmdp_invalidate(). And
> independent from that comment, we would need such an _invalidate() or
> _clear_flush() on s390 before the pmd_populate() because of the HW details
> that Christian described.
> 
> Reading the comment again, I do now notice that it also says "mark the current
> pmd notpresent

Re: [BUG] random kernel crashes after THP rework on s390 (maybe also on PowerPC and ARM)

2016-02-15 Thread Kirill A. Shutemov
On Sat, Feb 13, 2016 at 12:58:31PM +0100, Sebastian Ott wrote:
> 
> On Sat, 13 Feb 2016, Kirill A. Shutemov wrote:
> > Could you check if revert of fecffad25458 helps?
> 
> I reverted fecffad25458 on top of 721675fcf277cf - it oopsed with:
> 
> ¢ 1851.721062! Unable to handle kernel pointer dereference in virtual kernel 
> address space
> ¢ 1851.721075! failing address:  TEID: 0483
> ¢ 1851.721078! Fault in home space mode while using kernel ASCE.
> ¢ 1851.721085! AS:00d5c007 R3:0007 S:a800 
> P:003d
> ¢ 1851.721128! Oops: 0004 ilc:3 ¢#1! PREEMPT SMP DEBUG_PAGEALLOC
> ¢ 1851.721135! Modules linked in: bridge stp llc btrfs mlx4_ib mlx4_en ib_sa 
> ib_mad vxlan xor ip6_udp_tunnel ib_core udp_tunnel ptp pps_core ib_addr 
> ghash_s390raid6_pq prng ecb aes_s390 mlx4_core des_s390 des_generic 
> genwqe_card sha512_s390 sha256_s390 sha1_s390 sha_common crc_itu_t dm_mod 
> scm_block vhost_net tun vhost eadm_sch macvtap macvlan kvm autofs4
> ¢ 1851.721183! CPU: 7 PID: 256422 Comm: bash Not tainted 
> 4.5.0-rc3-00058-g07923d7-dirty #178
> ¢ 1851.721186! task: 7fbfd290 ti: 8c604000 task.ti: 
> 8c604000
> ¢ 1851.721189! Krnl PSW : 0704d0018000 0045d3b8 
> (__rb_erase_color+0x280/0x308)
> ¢ 1851.721200!R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 
> EA:3
>Krnl GPRS: 0001 0020  
> bd07eff1
> ¢ 1851.721205!0027ca10  83e45898 
> 77b61198
> ¢ 1851.721207!7ce1a490 bd07eff0 7ce1a548 
> 0027ca10
> ¢ 1851.721210!bd07c350 bd07eff0 8c607aa8 
> 8c607a68
> ¢ 1851.721221! Krnl Code: 0045d3aa: e3c0d0080024   stg 
> %%r12,8(%%r13)
>   0045d3b0: b9040039   lgr 
> %%r3,%%r9
>  #0045d3b4: a53b0001   oill%%r3,1
>  >0045d3b8: e3301024   stg 
> %%r3,0(%%r1)
>   0045d3be: ec28000e007c   cgij
> %%r2,0,8,45d3da
>   0045d3c4: e3402004   lg  
> %%r4,0(%%r2)
>   0045d3ca: b904001c   lgr 
> %%r1,%%r12
>   0045d3ce: ec143f3f0056   rosbg   
> %%r1,%%r4,63,63,0
> ¢ 1851.721269! Call Trace:
> ¢ 1851.721273! (¢<83e45898>! 0x83e45898)
> ¢ 1851.721279!  ¢<0029342a>! unlink_anon_vmas+0x9a/0x1d8
> ¢ 1851.721282!  ¢<00283f34>! free_pgtables+0xcc/0x148
> ¢ 1851.721285!  ¢<0028c376>! exit_mmap+0xd6/0x300
> ¢ 1851.721289!  ¢<00134db8>! mmput+0x90/0x118
> ¢ 1851.721294!  ¢<002d76bc>! flush_old_exec+0x5d4/0x700
> ¢ 1851.721298!  ¢<003369f4>! load_elf_binary+0x2f4/0x13e8
> ¢ 1851.721301!  ¢<002d6e4a>! search_binary_handler+0x9a/0x1f8
> ¢ 1851.721304!  ¢<002d8970>! do_execveat_common.isra.32+0x668/0x9a0
> ¢ 1851.721307!  ¢<002d8cec>! do_execve+0x44/0x58
> ¢ 1851.721310!  ¢<002d8f92>! SyS_execve+0x3a/0x48
> ¢ 1851.721315!  ¢<006fb096>! system_call+0xd6/0x258
> ¢ 1851.721317!  ¢<03ff997436d6>! 0x3ff997436d6
> ¢ 1851.721319! INFO: lockdep is turned off.
> ¢ 1851.721321! Last Breaking-Event-Address:
> ¢ 1851.721323!  ¢<0045d31a>! __rb_erase_color+0x1e2/0x308
> ¢ 1851.721327!
> ¢ 1851.721329! ---¢ end trace 0d80041ac00cfae2 !---
> 
> 
> > 
> > And could you share how crashes looks like? I haven't seen backtraces yet.
> > 
> 
> Sure. I didn't because they really looked random to me. Most of the time
> in rcu or list debugging but I thought these have just been the messenger
> observing a corruption first. Anyhow, here is an older one that might look
> interesting:
> 
> [   59.851421] list_del corruption. next->prev should be 6e1eb000, 
> but was 0400

This kinda interesting: 0x400 is TAIL_MAPPING.. Hm..

Could you check if you see the problem on commit 1c290f642101 and its
immediate parent?

-- 
 Kirill A. Shutemov
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [BUG] random kernel crashes after THP rework on s390 (maybe also on PowerPC and ARM)

2016-02-15 Thread Kirill A. Shutemov
On Mon, Feb 15, 2016 at 07:37:02PM +0100, Gerald Schaefer wrote:
> On Mon, 15 Feb 2016 13:31:59 +0200
> "Kirill A. Shutemov"  wrote:
> 
> > On Sat, Feb 13, 2016 at 12:58:31PM +0100, Sebastian Ott wrote:
> > > 
> > > On Sat, 13 Feb 2016, Kirill A. Shutemov wrote:
> > > > Could you check if revert of fecffad25458 helps?
> > > 
> > > I reverted fecffad25458 on top of 721675fcf277cf - it oopsed with:
> > > 
> > > ¢ 1851.721062! Unable to handle kernel pointer dereference in virtual 
> > > kernel address space
> > > ¢ 1851.721075! failing address:  TEID: 0483
> > > ¢ 1851.721078! Fault in home space mode while using kernel ASCE.
> > > ¢ 1851.721085! AS:00d5c007 R3:0007 S:a800 
> > > P:003d
> > > ¢ 1851.721128! Oops: 0004 ilc:3 ¢#1! PREEMPT SMP DEBUG_PAGEALLOC
> > > ¢ 1851.721135! Modules linked in: bridge stp llc btrfs mlx4_ib mlx4_en 
> > > ib_sa ib_mad vxlan xor ip6_udp_tunnel ib_core udp_tunnel ptp pps_core 
> > > ib_addr ghash_s390raid6_pq prng ecb aes_s390 mlx4_core des_s390 
> > > des_generic genwqe_card sha512_s390 sha256_s390 sha1_s390 sha_common 
> > > crc_itu_t dm_mod scm_block vhost_net tun vhost eadm_sch macvtap macvlan 
> > > kvm autofs4
> > > ¢ 1851.721183! CPU: 7 PID: 256422 Comm: bash Not tainted 
> > > 4.5.0-rc3-00058-g07923d7-dirty #178
> > > ¢ 1851.721186! task: 7fbfd290 ti: 8c604000 task.ti: 
> > > 8c604000
> > > ¢ 1851.721189! Krnl PSW : 0704d0018000 0045d3b8 
> > > (__rb_erase_color+0x280/0x308)
> > > ¢ 1851.721200!R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 
> > > PM:0 EA:3
> > >Krnl GPRS: 0001 0020 
> > >  bd07eff1
> > > ¢ 1851.721205!0027ca10  
> > > 83e45898 77b61198
> > > ¢ 1851.721207!7ce1a490 bd07eff0 
> > > 7ce1a548 0027ca10
> > > ¢ 1851.721210!bd07c350 bd07eff0 
> > > 8c607aa8 8c607a68
> > > ¢ 1851.721221! Krnl Code: 0045d3aa: e3c0d0080024   stg 
> > > %%r12,8(%%r13)
> > >   0045d3b0: b9040039   lgr 
> > > %%r3,%%r9
> > >  #0045d3b4: a53b0001   oill
> > > %%r3,1
> > >  >0045d3b8: e3301024   stg 
> > > %%r3,0(%%r1)
> > >   0045d3be: ec28000e007c   cgij
> > > %%r2,0,8,45d3da
> > >   0045d3c4: e3402004   lg  
> > > %%r4,0(%%r2)
> > >   0045d3ca: b904001c   lgr 
> > > %%r1,%%r12
> > >   0045d3ce: ec143f3f0056   rosbg   
> > > %%r1,%%r4,63,63,0
> > > ¢ 1851.721269! Call Trace:
> > > ¢ 1851.721273! (¢<83e45898>! 0x83e45898)
> > > ¢ 1851.721279!  ¢<0029342a>! unlink_anon_vmas+0x9a/0x1d8
> > > ¢ 1851.721282!  ¢<00283f34>! free_pgtables+0xcc/0x148
> > > ¢ 1851.721285!  ¢<0028c376>! exit_mmap+0xd6/0x300
> > > ¢ 1851.721289!  ¢<00134db8>! mmput+0x90/0x118
> > > ¢ 1851.721294!  ¢<002d76bc>! flush_old_exec+0x5d4/0x700
> > > ¢ 1851.721298!  ¢<003369f4>! load_elf_binary+0x2f4/0x13e8
> > > ¢ 1851.721301!  ¢<002d6e4a>! search_binary_handler+0x9a/0x1f8
> > > ¢ 1851.721304!  ¢<002d8970>! 
> > > do_execveat_common.isra.32+0x668/0x9a0
> > > ¢ 1851.721307!  ¢<002d8cec>! do_execve+0x44/0x58
> > > ¢ 1851.721310!  ¢<002d8f92>! SyS_execve+0x3a/0x48
> > > ¢ 1851.721315!  ¢<006fb096>! system_call+0xd6/0x258
> > > ¢ 1851.721317!  ¢<03ff997436d6>! 0x3ff997436d6
> > > ¢ 1851.721319! INFO: lockdep is turned off.
> > > ¢ 1851.721321! Last Breaking-Event-Address:
> > > ¢ 1851.721323!  ¢<0045d31a>! __rb_erase_color+0x1e2/0x308
> > > ¢ 1851.721327!
> > > ¢ 1851.721329! ---¢ end trace 0d80041ac00cfae2 !---
> > > 
> > > 
> > > > 
> > > > And could you share how crashes looks like? I haven't seen backtraces 
> > > > yet.
> > > > 
> > > 
> > > Sure. I didn&

Re: [PATCH V2] mm: move vma_is_anonymous check within pmd_move_must_withdraw

2016-11-11 Thread Kirill A. Shutemov
On Mon, Nov 07, 2016 at 08:16:39PM +0530, Aneesh Kumar K.V wrote:
> Architectures like ppc64 want to use page table deposit/withraw
> even with huge pmd dax entries. Allow arch to override the
> vma_is_anonymous check by moving that to pmd_move_must_withdraw
> function
> 
> Signed-off-by: Aneesh Kumar K.V 

Acked-by: Kirill A. Shutemov 

-- 
 Kirill A. Shutemov


Re: [PATCH 2/2] mm: THP page cache support for ppc64

2016-11-11 Thread Kirill A. Shutemov
On Mon, Nov 07, 2016 at 02:04:41PM +0530, Aneesh Kumar K.V wrote:
> @@ -2953,6 +2966,13 @@ static int do_set_pmd(struct fault_env *fe, struct 
> page *page)
>   ret = VM_FAULT_FALLBACK;
>   page = compound_head(page);
>  
> + /*
> +  * Archs like ppc64 need additonal space to store information
> +  * related to pte entry. Use the preallocated table for that.
> +  */
> + if (arch_needs_pgtable_deposit() && !fe->prealloc_pte)
> + fe->prealloc_pte = pte_alloc_one(vma->vm_mm, fe->address);
> +

-ENOMEM handling?

I think we should do this way before this point. Maybe in do_fault() or
something.

-- 
 Kirill A. Shutemov


Re: [PATCH 2/2] mm: THP page cache support for ppc64

2016-11-11 Thread Kirill A. Shutemov
On Fri, Nov 11, 2016 at 05:42:11PM +0530, Aneesh Kumar K.V wrote:
> "Kirill A. Shutemov"  writes:
> 
> > On Mon, Nov 07, 2016 at 02:04:41PM +0530, Aneesh Kumar K.V wrote:
> >> @@ -2953,6 +2966,13 @@ static int do_set_pmd(struct fault_env *fe, struct 
> >> page *page)
> >>ret = VM_FAULT_FALLBACK;
> >>page = compound_head(page);
> >>  
> >> +  /*
> >> +   * Archs like ppc64 need additonal space to store information
> >> +   * related to pte entry. Use the preallocated table for that.
> >> +   */
> >> +  if (arch_needs_pgtable_deposit() && !fe->prealloc_pte)
> >> +  fe->prealloc_pte = pte_alloc_one(vma->vm_mm, fe->address);
> >> +
> >
> > -ENOMEM handling?
> 
> How about
> 
>   if (arch_needs_pgtable_deposit() && !fe->prealloc_pte) {
>   fe->prealloc_pte = pte_alloc_one(vma->vm_mm, fe->address);
>   if (!fe->prealloc_pte)
>   return VM_FAULT_OOM;
>   }
> 
> 
> 
> >
> > I think we should do this way before this point. Maybe in do_fault() or
> > something.
> 
> doing this in do_set_pmd keeps this closer to where we set the pmd. Any
> reason you thing we should move it higher up the stack. We already do
> pte_alloc() at the same level for a non transhuge case in
> alloc_set_pte().

I vaguely remember Hugh mentioned deadlock of allocation under page-lock vs.
OOM-killer (or something else?).

If the deadlock is still there it would be matter of making preallocation
unconditional to fix the issue.

But what you propose about doesn't make situation any worse. I'm fine with
that.

-- 
 Kirill A. Shutemov


Re: [PATCH V3 2/2] mm: THP page cache support for ppc64

2016-11-13 Thread Kirill A. Shutemov
On Sun, Nov 13, 2016 at 08:30:25PM +0530, Aneesh Kumar K.V wrote:
> Add arch specific callback in the generic THP page cache code that will
> deposit and withdarw preallocated page table. Archs like ppc64 use
> this preallocated table to store the hash pte slot information.
> 
> Testing:
> kernel build of the patch series on tmpfs mounted with option huge=always
> 
> The related thp stat:
> thp_fault_alloc 72939
> thp_fault_fallback 60547
> thp_collapse_alloc 603
> thp_collapse_alloc_failed 0
> thp_file_alloc 253763
> thp_file_mapped 4251
> thp_split_page 51518
> thp_split_page_failed 1
> thp_deferred_split_page 73566
> thp_split_pmd 665
> thp_zero_page_alloc 3
> thp_zero_page_alloc_failed 0
> 
> Signed-off-by: Aneesh Kumar K.V 

One nit-pick below, but otherwise

Acked-by: Kirill A. Shutemov 

> @@ -2975,6 +3004,13 @@ static int do_set_pmd(struct fault_env *fe, struct 
> page *page)
>   ret = 0;
>   count_vm_event(THP_FILE_MAPPED);
>  out:
> + /*
> +  * If we are going to fallback to pte mapping, do a
> +  * withdraw with pmd lock held.
> +  */
> + if (arch_needs_pgtable_deposit() && (ret == VM_FAULT_FALLBACK))

Parenthesis are redundant around ret check.

> + fe->prealloc_pte = pgtable_trans_huge_withdraw(vma->vm_mm,
> +            fe->pmd);
>   spin_unlock(fe->ptl);
>   return ret;
>  }

-- 
 Kirill A. Shutemov


Re: [RFC 1/4] mm: remove unused TASK_SIZE_OF()

2017-01-02 Thread Kirill A. Shutemov
On Fri, Dec 30, 2016 at 06:56:31PM +0300, Dmitry Safonov wrote:
> All users of TASK_SIZE_OF(tsk) have migrated to mm->task_size or
> TASK_SIZE_MAX since:
> commit d696ca016d57 ("x86/fsgsbase/64: Use TASK_SIZE_MAX for
> FSBASE/GSBASE upper limits"),
> commit a06db751c321 ("pagemap: check permissions and capabilities at
> open time"),
> 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Cc: Ralf Baechle 
> Cc: "James E.J. Bottomley" 
> Cc: Helge Deller 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: Michael Ellerman 
> Cc: Martin Schwidefsky 
> Cc: Heiko Carstens 
> Cc: "David S. Miller" 
> Cc: Peter Zijlstra 
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: linux-m...@linux-mips.org
> Cc: linux-par...@vger.kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-s...@vger.kernel.org
> Cc: sparcli...@vger.kernel.org
> Signed-off-by: Dmitry Safonov 

I've noticed this too.

Acked-by: Kirill A. Shutemov 

-- 
 Kirill A. Shutemov


Re: [PATCH] mm: stop leaking PageTables

2017-01-08 Thread Kirill A. Shutemov
On Sat, Jan 07, 2017 at 03:37:31PM -0800, Hugh Dickins wrote:
> 4.10-rc loadtest (even on x86, even without THPCache) fails with
> "fork: Cannot allocate memory" or some such; and /proc/meminfo
> shows PageTables growing.
> 
> rc1 removed the freeing of an unused preallocated pagetable after
> do_fault_around() has called map_pages(): which is usually a good
> optimization, so that the followup doesn't have to reallocate one;
> but it's not sufficient to shift the freeing into alloc_set_pte(),
> since there are failure cases (most commonly VM_FAULT_RETRY) which
> never reach finish_fault().
> 
> Check and free it at the outer level in do_fault(), then we don't
> need to worry in alloc_set_pte(), and can restore that to how it was
> (I cannot find any reason to pte_free() under lock as it was doing).
> 
> And fix a separate pagetable leak, or crash, introduced by the same
> change, that could only show up on some ppc64: why does do_set_pmd()'s
> failure case attempt to withdraw a pagetable when it never deposited
> one, at the same time overwriting (so leaking) the vmf->prealloc_pte?
> Residue of an earlier implementation, perhaps?  Delete it.
> 
> Fixes: 953c66c2b22a ("mm: THP page cache support for ppc64")
> Signed-off-by: Hugh Dickins 

Sorry, that I missed this initially.

Acked-by: Kirill A. Shutemov 

-- 
 Kirill A. Shutemov


Re: [RFC PATCH] powerpc/mm: THP page cache support

2016-09-26 Thread Kirill A. Shutemov
On Thu, Sep 22, 2016 at 09:32:40PM +0530, Aneesh Kumar K.V wrote:
> Update arch hook in the generic THP page cache code, that will
> deposit and withdarw preallocated page table. Archs like ppc64 use
> this preallocated table to store the hash pte slot information.
> 
> This is an RFC patch and I am sharing this early to get feedback on the
> approach taken. I have used stress-ng mmap-file operation and that
> resulted in some thp_file_mmap as show below.
> 
> [/mnt/stress]$ grep thp_file /proc/vmstat
> thp_file_alloc 25403
> thp_file_mapped 16967
> [/mnt/stress]$
> 
> I did observe wrong nr_ptes count once. I need to recreate the problem
> again.

I don't see anything that could cause that.

The patch looks good to me (apart from nr_ptes issue). Few minor nitpicks
below.

> Signed-off-by: Aneesh Kumar K.V 
> ---
>  arch/powerpc/include/asm/book3s/64/pgtable.h |  3 ++
>  include/asm-generic/pgtable.h|  8 +++-
>  mm/Kconfig   |  6 +--
>  mm/huge_memory.c | 19 +-
>  mm/khugepaged.c  | 21 ++-
>  mm/memory.c  | 56 
> +++-
>  6 files changed, 93 insertions(+), 20 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
> b/arch/powerpc/include/asm/book3s/64/pgtable.h
> index 263bf39ced40..1f45b06ce78e 100644
> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
> @@ -1017,6 +1017,9 @@ static inline int pmd_move_must_withdraw(struct 
> spinlock *new_pmd_ptl,
>*/
>   return true;
>  }
> +
> +#define arch_needs_pgtable_deposit() (true)
> +
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  #endif /* __ASSEMBLY__ */
>  #endif /* _ASM_POWERPC_BOOK3S_64_PGTABLE_H_ */
> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> index d4458b6dbfb4..0d1e400e82a2 100644
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -660,11 +660,17 @@ static inline int pmd_move_must_withdraw(spinlock_t 
> *new_pmd_ptl,
>   /*
>* With split pmd lock we also need to move preallocated
>* PTE page table if new_pmd is on different PMD page table.
> +  *
> +  * We also don't deposit and withdraw tables for file pages.
>*/
> - return new_pmd_ptl != old_pmd_ptl;
> + return (new_pmd_ptl != old_pmd_ptl) && vma_is_anonymous(vma);
>  }
>  #endif
>  
> +#ifndef arch_needs_pgtable_deposit
> +#define arch_needs_pgtable_deposit() (false)
> +#endif
> +
>  /*
>   * This function is meant to be used by sites walking pagetables with
>   * the mmap_sem hold in read mode to protect against MADV_DONTNEED and
> diff --git a/mm/Kconfig b/mm/Kconfig
> index be0ee11fa0d9..0a279d399722 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -447,13 +447,9 @@ choice
> benefit.
>  endchoice
>  
> -#
> -# We don't deposit page tables on file THP mapping,
> -# but Power makes use of them to address MMU quirk.
> -#
>  config   TRANSPARENT_HUGE_PAGECACHE
>   def_bool y
> - depends on TRANSPARENT_HUGEPAGE && !PPC
> + depends on TRANSPARENT_HUGEPAGE
>  
>  #
>  # UP and nommu archs use km based percpu allocator
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index a6abd76baa72..37176f455d16 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1320,6 +1320,14 @@ out_unlocked:
>   return ret;
>  }
>  
> +void zap_deposited_table(struct mm_struct *mm, pmd_t *pmd)

static?

> +{
> + pgtable_t pgtable;
> + pgtable = pgtable_trans_huge_withdraw(mm, pmd);
> + pte_free(mm, pgtable);
> + atomic_long_dec(&mm->nr_ptes);
> +}
> +
>  int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>pmd_t *pmd, unsigned long addr)
>  {
> @@ -1359,6 +1367,8 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct 
> vm_area_struct *vma,
>   atomic_long_dec(&tlb->mm->nr_ptes);
>   add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
>   } else {
> + if (arch_needs_pgtable_deposit())

Just hide the arch_needs_pgtable_deposit() check in zap_deposited_table().

> + zap_deposited_table(tlb->mm, pmd);
>   add_mm_counter(tlb->mm, MM_FILEPAGES, -HPAGE_PMD_NR);
>   }
>   spin_unlock(ptl);
-- 
 Kirill A. Shutemov


Re: [PATCH v2 1/2] treewide: remove unused address argument from pte_alloc functions

2018-10-12 Thread Kirill A. Shutemov
On Thu, Oct 11, 2018 at 06:37:55PM -0700, Joel Fernandes (Google) wrote:
> diff --git a/arch/m68k/include/asm/mcf_pgalloc.h 
> b/arch/m68k/include/asm/mcf_pgalloc.h
> index 12fe700632f4..4399d712f6db 100644
> --- a/arch/m68k/include/asm/mcf_pgalloc.h
> +++ b/arch/m68k/include/asm/mcf_pgalloc.h
> @@ -12,8 +12,7 @@ extern inline void pte_free_kernel(struct mm_struct *mm, 
> pte_t *pte)
>  
>  extern const char bad_pmd_string[];
>  
> -extern inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
> - unsigned long address)
> +extern inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
>  {
>   unsigned long page = __get_free_page(GFP_DMA);
>  
> @@ -32,8 +31,6 @@ extern inline pmd_t *pmd_alloc_kernel(pgd_t *pgd, unsigned 
> long address)
>  #define pmd_alloc_one_fast(mm, address) ({ BUG(); ((pmd_t *)1); })
>  #define pmd_alloc_one(mm, address)  ({ BUG(); ((pmd_t *)2); })
>  
> -#define pte_alloc_one_fast(mm, addr) pte_alloc_one(mm, addr)
> -

I believe this was one done manually, right?
Please explicitely state everthing you did on not of sematic patch

...

> diff --git a/arch/microblaze/include/asm/pgalloc.h 
> b/arch/microblaze/include/asm/pgalloc.h
> index 7c89390c0c13..f4cc9ffc449e 100644
> --- a/arch/microblaze/include/asm/pgalloc.h
> +++ b/arch/microblaze/include/asm/pgalloc.h
> @@ -108,10 +108,9 @@ static inline void free_pgd_slow(pgd_t *pgd)
>  #define pmd_alloc_one_fast(mm, address)  ({ BUG(); ((pmd_t *)1); })
>  #define pmd_alloc_one(mm, address)   ({ BUG(); ((pmd_t *)2); })
>  
> -extern pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long addr);
> +extern pte_t *pte_alloc_one_kernel(struct mm_struct *mm);
>  
> -static inline struct page *pte_alloc_one(struct mm_struct *mm,
> - unsigned long address)
> +static inline struct page *pte_alloc_one(struct mm_struct *mm)
>  {
>   struct page *ptepage;
>  
> @@ -132,20 +131,6 @@ static inline struct page *pte_alloc_one(struct 
> mm_struct *mm,
>   return ptepage;
>  }
>  
> -static inline pte_t *pte_alloc_one_fast(struct mm_struct *mm,
> - unsigned long address)
> -{
> - unsigned long *ret;
> -
> - ret = pte_quicklist;
> - if (ret != NULL) {
> - pte_quicklist = (unsigned long *)(*ret);
> - ret[0] = 0;
> - pgtable_cache_size--;
> - }
> - return (pte_t *)ret;
> -}
> -

Ditto.

-- 
 Kirill A. Shutemov


Re: [PATCH v2 2/2] mm: speed up mremap by 500x on large regions

2018-10-12 Thread Kirill A. Shutemov
On Thu, Oct 11, 2018 at 06:37:56PM -0700, Joel Fernandes (Google) wrote:
> Android needs to mremap large regions of memory during memory management
> related operations. The mremap system call can be really slow if THP is
> not enabled. The bottleneck is move_page_tables, which is copying each
> pte at a time, and can be really slow across a large map. Turning on THP
> may not be a viable option, and is not for us. This patch speeds up the
> performance for non-THP system by copying at the PMD level when possible.
> 
> The speed up is three orders of magnitude. On a 1GB mremap, the mremap
> completion times drops from 160-250 millesconds to 380-400 microseconds.
> 
> Before:
> Total mremap time for 1GB data: 242321014 nanoseconds.
> Total mremap time for 1GB data: 196842467 nanoseconds.
> Total mremap time for 1GB data: 167051162 nanoseconds.
> 
> After:
> Total mremap time for 1GB data: 385781 nanoseconds.
> Total mremap time for 1GB data: 388959 nanoseconds.
> Total mremap time for 1GB data: 402813 nanoseconds.
> 
> Incase THP is enabled, the optimization is skipped. I also flush the
> tlb every time we do this optimization since I couldn't find a way to
> determine if the low-level PTEs are dirty. It is seen that the cost of
> doing so is not much compared the improvement, on both x86-64 and arm64.

I looked into the code more and noticed move_pte() helper called from
move_ptes(). It changes PTE entry to suite new address.

It is only defined in non-trivial way on Sparc. I don't know much about
Sparc and it's hard for me to say if the optimization will break anything
there.

I think it worth to disable the optimization if __HAVE_ARCH_MOVE_PTE is
defined. Or make architectures state explicitely that the optimization is
safe.

> @@ -239,7 +287,21 @@ unsigned long move_page_tables(struct vm_area_struct 
> *vma,
>   split_huge_pmd(vma, old_pmd, old_addr);
>   if (pmd_trans_unstable(old_pmd))
>   continue;
> + } else if (extent == PMD_SIZE) {

Hm. What guarantees that new_addr is PMD_SIZE-aligned?
It's not obvious to me.

-- 
 Kirill A. Shutemov


Re: [PATCH v2 2/2] mm: speed up mremap by 500x on large regions

2018-10-12 Thread Kirill A. Shutemov
On Fri, Oct 12, 2018 at 02:30:56PM +0300, Kirill A. Shutemov wrote:
> On Thu, Oct 11, 2018 at 06:37:56PM -0700, Joel Fernandes (Google) wrote:
> > @@ -239,7 +287,21 @@ unsigned long move_page_tables(struct vm_area_struct 
> > *vma,
> > split_huge_pmd(vma, old_pmd, old_addr);
> > if (pmd_trans_unstable(old_pmd))
> > continue;
> > +   } else if (extent == PMD_SIZE) {
> 
> Hm. What guarantees that new_addr is PMD_SIZE-aligned?
> It's not obvious to me.

Ignore this :)

-- 
 Kirill A. Shutemov


Re: [PATCH v2 2/2] mm: speed up mremap by 500x on large regions

2018-10-12 Thread Kirill A. Shutemov
On Fri, Oct 12, 2018 at 05:50:46AM -0700, Joel Fernandes wrote:
> On Fri, Oct 12, 2018 at 02:30:56PM +0300, Kirill A. Shutemov wrote:
> > On Thu, Oct 11, 2018 at 06:37:56PM -0700, Joel Fernandes (Google) wrote:
> > > Android needs to mremap large regions of memory during memory management
> > > related operations. The mremap system call can be really slow if THP is
> > > not enabled. The bottleneck is move_page_tables, which is copying each
> > > pte at a time, and can be really slow across a large map. Turning on THP
> > > may not be a viable option, and is not for us. This patch speeds up the
> > > performance for non-THP system by copying at the PMD level when possible.
> > > 
> > > The speed up is three orders of magnitude. On a 1GB mremap, the mremap
> > > completion times drops from 160-250 millesconds to 380-400 microseconds.
> > > 
> > > Before:
> > > Total mremap time for 1GB data: 242321014 nanoseconds.
> > > Total mremap time for 1GB data: 196842467 nanoseconds.
> > > Total mremap time for 1GB data: 167051162 nanoseconds.
> > > 
> > > After:
> > > Total mremap time for 1GB data: 385781 nanoseconds.
> > > Total mremap time for 1GB data: 388959 nanoseconds.
> > > Total mremap time for 1GB data: 402813 nanoseconds.
> > > 
> > > Incase THP is enabled, the optimization is skipped. I also flush the
> > > tlb every time we do this optimization since I couldn't find a way to
> > > determine if the low-level PTEs are dirty. It is seen that the cost of
> > > doing so is not much compared the improvement, on both x86-64 and arm64.
> > 
> > I looked into the code more and noticed move_pte() helper called from
> > move_ptes(). It changes PTE entry to suite new address.
> > 
> > It is only defined in non-trivial way on Sparc. I don't know much about
> > Sparc and it's hard for me to say if the optimization will break anything
> > there.
> 
> Sparc's move_pte seems to be flushing the D-cache to prevent aliasing. It is
> not modifying the PTE itself AFAICS:
> 
> #ifdef DCACHE_ALIASING_POSSIBLE
> #define __HAVE_ARCH_MOVE_PTE
> #define move_pte(pte, prot, old_addr, new_addr) \
> ({  \
> pte_t newpte = (pte);   \
> if (tlb_type != hypervisor && pte_present(pte)) {   \
> unsigned long this_pfn = pte_pfn(pte);  \
> \
> if (pfn_valid(this_pfn) &&  \
> (((old_addr) ^ (new_addr)) & (1 << 13)))\
> flush_dcache_page_all(current->mm,  \
>   pfn_to_page(this_pfn));   \
> }   \
> newpte; \
> })
> #endif
> 
> If its an issue, then how do transparent huge pages work on Sparc?  I don't
> see the huge page code (move_huge_pages) during mremap doing anything special
> for Sparc architecture when moving PMDs..

My *guess* is that it will work fine on Sparc as it apprarently it only
cares about change in bit 13 of virtual address. It will never happen for
huge pages or when PTE page tables move.

But I just realized that the problem is bigger: since we pass new_addr to
the set_pte_at() we would need to audit all implementations that they are
safe with just moving PTE page table.

I would rather go with per-architecture enabling. It's much safer.

> Also, do we not flush the caches from any path when we munmap address space?
> We do call do_munmap on the old mapping from mremap after moving to the new 
> one.

Are you sure about that? It can be hided deeper in architecture-specific
code.

-- 
 Kirill A. Shutemov


Re: [PATCH v2 2/2] mm: speed up mremap by 500x on large regions

2018-10-12 Thread Kirill A. Shutemov
On Fri, Oct 12, 2018 at 03:09:49PM +0100, Anton Ivanov wrote:
> On 10/12/18 2:37 AM, Joel Fernandes (Google) wrote:
> > Android needs to mremap large regions of memory during memory management
> > related operations. The mremap system call can be really slow if THP is
> > not enabled. The bottleneck is move_page_tables, which is copying each
> > pte at a time, and can be really slow across a large map. Turning on THP
> > may not be a viable option, and is not for us. This patch speeds up the
> > performance for non-THP system by copying at the PMD level when possible.
> > 
> > The speed up is three orders of magnitude. On a 1GB mremap, the mremap
> > completion times drops from 160-250 millesconds to 380-400 microseconds.
> > 
> > Before:
> > Total mremap time for 1GB data: 242321014 nanoseconds.
> > Total mremap time for 1GB data: 196842467 nanoseconds.
> > Total mremap time for 1GB data: 167051162 nanoseconds.
> > 
> > After:
> > Total mremap time for 1GB data: 385781 nanoseconds.
> > Total mremap time for 1GB data: 388959 nanoseconds.
> > Total mremap time for 1GB data: 402813 nanoseconds.
> > 
> > Incase THP is enabled, the optimization is skipped. I also flush the
> > tlb every time we do this optimization since I couldn't find a way to
> > determine if the low-level PTEs are dirty. It is seen that the cost of
> > doing so is not much compared the improvement, on both x86-64 and arm64.
> > 
> > Cc: minc...@kernel.org
> > Cc: pan...@google.com
> > Cc: hu...@google.com
> > Cc: lokeshgi...@google.com
> > Cc: dan...@google.com
> > Cc: mho...@kernel.org
> > Cc: kir...@shutemov.name
> > Cc: a...@linux-foundation.org
> > Signed-off-by: Joel Fernandes (Google) 
> > ---
> >   mm/mremap.c | 62 +
> >   1 file changed, 62 insertions(+)
> > 
> > diff --git a/mm/mremap.c b/mm/mremap.c
> > index 9e68a02a52b1..d82c485822ef 100644
> > --- a/mm/mremap.c
> > +++ b/mm/mremap.c
> > @@ -191,6 +191,54 @@ static void move_ptes(struct vm_area_struct *vma, 
> > pmd_t *old_pmd,
> > drop_rmap_locks(vma);
> >   }
> > +static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long 
> > old_addr,
> > + unsigned long new_addr, unsigned long old_end,
> > + pmd_t *old_pmd, pmd_t *new_pmd, bool *need_flush)
> > +{
> > +   spinlock_t *old_ptl, *new_ptl;
> > +   struct mm_struct *mm = vma->vm_mm;
> > +
> > +   if ((old_addr & ~PMD_MASK) || (new_addr & ~PMD_MASK)
> > +   || old_end - old_addr < PMD_SIZE)
> > +   return false;
> > +
> > +   /*
> > +* The destination pmd shouldn't be established, free_pgtables()
> > +* should have release it.
> > +*/
> > +   if (WARN_ON(!pmd_none(*new_pmd)))
> > +   return false;
> > +
> > +   /*
> > +* We don't have to worry about the ordering of src and dst
> > +* ptlocks because exclusive mmap_sem prevents deadlock.
> > +*/
> > +   old_ptl = pmd_lock(vma->vm_mm, old_pmd);
> > +   if (old_ptl) {
> > +   pmd_t pmd;
> > +
> > +   new_ptl = pmd_lockptr(mm, new_pmd);
> > +   if (new_ptl != old_ptl)
> > +   spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
> > +
> > +   /* Clear the pmd */
> > +   pmd = *old_pmd;
> > +   pmd_clear(old_pmd);
> > +
> > +   VM_BUG_ON(!pmd_none(*new_pmd));
> > +
> > +       /* Set the new pmd */
> > +   set_pmd_at(mm, new_addr, new_pmd, pmd);
> 
> UML does not have set_pmd_at at all

Every architecture does. :)

But it may come not from the arch code.

> If I read the code right, MIPS completely ignores the address argument so
> set_pmd_at there may not have the effect which this patch is trying to
> achieve.

Ignoring address is fine. Most architectures do that..
The ideas is to move page table to the new pmd slot. It's nothing to do
with the address passed to set_pmd_at().

-- 
 Kirill A. Shutemov


Re: [PATCH v2 2/2] mm: speed up mremap by 500x on large regions

2018-10-12 Thread Kirill A. Shutemov
On Fri, Oct 12, 2018 at 09:57:19AM -0700, Joel Fernandes wrote:
> On Fri, Oct 12, 2018 at 04:19:46PM +0300, Kirill A. Shutemov wrote:
> > On Fri, Oct 12, 2018 at 05:50:46AM -0700, Joel Fernandes wrote:
> > > On Fri, Oct 12, 2018 at 02:30:56PM +0300, Kirill A. Shutemov wrote:
> > > > On Thu, Oct 11, 2018 at 06:37:56PM -0700, Joel Fernandes (Google) wrote:
> > > > > Android needs to mremap large regions of memory during memory 
> > > > > management
> > > > > related operations. The mremap system call can be really slow if THP 
> > > > > is
> > > > > not enabled. The bottleneck is move_page_tables, which is copying each
> > > > > pte at a time, and can be really slow across a large map. Turning on 
> > > > > THP
> > > > > may not be a viable option, and is not for us. This patch speeds up 
> > > > > the
> > > > > performance for non-THP system by copying at the PMD level when 
> > > > > possible.
> > > > > 
> > > > > The speed up is three orders of magnitude. On a 1GB mremap, the mremap
> > > > > completion times drops from 160-250 millesconds to 380-400 
> > > > > microseconds.
> > > > > 
> > > > > Before:
> > > > > Total mremap time for 1GB data: 242321014 nanoseconds.
> > > > > Total mremap time for 1GB data: 196842467 nanoseconds.
> > > > > Total mremap time for 1GB data: 167051162 nanoseconds.
> > > > > 
> > > > > After:
> > > > > Total mremap time for 1GB data: 385781 nanoseconds.
> > > > > Total mremap time for 1GB data: 388959 nanoseconds.
> > > > > Total mremap time for 1GB data: 402813 nanoseconds.
> > > > > 
> > > > > Incase THP is enabled, the optimization is skipped. I also flush the
> > > > > tlb every time we do this optimization since I couldn't find a way to
> > > > > determine if the low-level PTEs are dirty. It is seen that the cost of
> > > > > doing so is not much compared the improvement, on both x86-64 and 
> > > > > arm64.
> > > > 
> > > > I looked into the code more and noticed move_pte() helper called from
> > > > move_ptes(). It changes PTE entry to suite new address.
> > > > 
> > > > It is only defined in non-trivial way on Sparc. I don't know much about
> > > > Sparc and it's hard for me to say if the optimization will break 
> > > > anything
> > > > there.
> > > 
> > > Sparc's move_pte seems to be flushing the D-cache to prevent aliasing. It 
> > > is
> > > not modifying the PTE itself AFAICS:
> > > 
> > > #ifdef DCACHE_ALIASING_POSSIBLE
> > > #define __HAVE_ARCH_MOVE_PTE
> > > #define move_pte(pte, prot, old_addr, new_addr) \
> > > ({  \
> > > pte_t newpte = (pte);   \
> > > if (tlb_type != hypervisor && pte_present(pte)) {   \
> > > unsigned long this_pfn = pte_pfn(pte);  \
> > > \
> > > if (pfn_valid(this_pfn) &&  \
> > > (((old_addr) ^ (new_addr)) & (1 << 13)))\
> > > flush_dcache_page_all(current->mm,  \
> > >   pfn_to_page(this_pfn));   \
> > > }   \
> > > newpte; \
> > > })
> > > #endif
> > > 
> > > If its an issue, then how do transparent huge pages work on Sparc?  I 
> > > don't
> > > see the huge page code (move_huge_pages) during mremap doing anything 
> > > special
> > > for Sparc architecture when moving PMDs..
> > 
> > My *guess* is that it will work fine on Sparc as it apprarently it only
> > cares about change in bit 13 of virtual address. It will never happen for
> > huge pages or when PTE page tables move.
> > 
> > But I just realized that the problem is bigger: since we pass new_addr to
> > the set_pte_at() we would need to audit all implementations that they are
> > safe with just moving PTE page table.
> > 
> > I would rather go with per-architecture enabling. It's much safer.
> 
> I'm Ok with the per-arch enabling, I agree its safer. So I should be adding a
> a new __HAVE_ARCH_MOVE_PMD right, or did you have a better name for that?

I believe Kconfig option is more cononical way to do this nowadays.
So CONFIG_HAVE_ARCH_MOVE_PMD, I guess. Or CONFIG_HAVE_MOVE_PMD.
An arch that supports it would select the option.

> Also, do you feel we should still need to remove the address argument from
> set_pte_alloc? Or should we leave that alone if we do per-arch?
> I figure I spent a bunch of time on that already anyway, and its a clean up
> anyway, so may as well do it. But perhaps that "pte_alloc cleanup" can then
> be a separate patch independent of this series?

Yeah. The cleanup makes sense anyway.

-- 
 Kirill A. Shutemov


Re: [PATCH v2 2/2] mm: speed up mremap by 500x on large regions

2018-10-12 Thread Kirill A. Shutemov
On Fri, Oct 12, 2018 at 05:42:24PM +0100, Anton Ivanov wrote:
> 
> On 10/12/18 3:48 PM, Anton Ivanov wrote:
> > On 12/10/2018 15:37, Kirill A. Shutemov wrote:
> > > On Fri, Oct 12, 2018 at 03:09:49PM +0100, Anton Ivanov wrote:
> > > > On 10/12/18 2:37 AM, Joel Fernandes (Google) wrote:
> > > > > Android needs to mremap large regions of memory during
> > > > > memory management
> > > > > related operations. The mremap system call can be really
> > > > > slow if THP is
> > > > > not enabled. The bottleneck is move_page_tables, which is copying each
> > > > > pte at a time, and can be really slow across a large map.
> > > > > Turning on THP
> > > > > may not be a viable option, and is not for us. This patch
> > > > > speeds up the
> > > > > performance for non-THP system by copying at the PMD level
> > > > > when possible.
> > > > > 
> > > > > The speed up is three orders of magnitude. On a 1GB mremap, the mremap
> > > > > completion times drops from 160-250 millesconds to 380-400
> > > > > microseconds.
> > > > > 
> > > > > Before:
> > > > > Total mremap time for 1GB data: 242321014 nanoseconds.
> > > > > Total mremap time for 1GB data: 196842467 nanoseconds.
> > > > > Total mremap time for 1GB data: 167051162 nanoseconds.
> > > > > 
> > > > > After:
> > > > > Total mremap time for 1GB data: 385781 nanoseconds.
> > > > > Total mremap time for 1GB data: 388959 nanoseconds.
> > > > > Total mremap time for 1GB data: 402813 nanoseconds.
> > > > > 
> > > > > Incase THP is enabled, the optimization is skipped. I also flush the
> > > > > tlb every time we do this optimization since I couldn't find a way to
> > > > > determine if the low-level PTEs are dirty. It is seen that the cost of
> > > > > doing so is not much compared the improvement, on both
> > > > > x86-64 and arm64.
> > > > > 
> > > > > Cc: minc...@kernel.org
> > > > > Cc: pan...@google.com
> > > > > Cc: hu...@google.com
> > > > > Cc: lokeshgi...@google.com
> > > > > Cc: dan...@google.com
> > > > > Cc: mho...@kernel.org
> > > > > Cc: kir...@shutemov.name
> > > > > Cc: a...@linux-foundation.org
> > > > > Signed-off-by: Joel Fernandes (Google) 
> > > > > ---
> > > > >    mm/mremap.c | 62
> > > > > +
> > > > >    1 file changed, 62 insertions(+)
> > > > > 
> > > > > diff --git a/mm/mremap.c b/mm/mremap.c
> > > > > index 9e68a02a52b1..d82c485822ef 100644
> > > > > --- a/mm/mremap.c
> > > > > +++ b/mm/mremap.c
> > > > > @@ -191,6 +191,54 @@ static void move_ptes(struct
> > > > > vm_area_struct *vma, pmd_t *old_pmd,
> > > > >    drop_rmap_locks(vma);
> > > > >    }
> > > > > +static bool move_normal_pmd(struct vm_area_struct *vma,
> > > > > unsigned long old_addr,
> > > > > +  unsigned long new_addr, unsigned long old_end,
> > > > > +  pmd_t *old_pmd, pmd_t *new_pmd, bool *need_flush)
> > > > > +{
> > > > > +    spinlock_t *old_ptl, *new_ptl;
> > > > > +    struct mm_struct *mm = vma->vm_mm;
> > > > > +
> > > > > +    if ((old_addr & ~PMD_MASK) || (new_addr & ~PMD_MASK)
> > > > > +    || old_end - old_addr < PMD_SIZE)
> > > > > +    return false;
> > > > > +
> > > > > +    /*
> > > > > + * The destination pmd shouldn't be established, free_pgtables()
> > > > > + * should have release it.
> > > > > + */
> > > > > +    if (WARN_ON(!pmd_none(*new_pmd)))
> > > > > +    return false;
> > > > > +
> > > > > +    /*
> > > > > + * We don't have to worry about the ordering of src and dst
> > > > > + * ptlocks because exclusive mmap_sem prevents deadlock.
> > > > > + */
> > > > > +    old_ptl = pmd_lock(vma->vm_mm, old_pmd);
> > > > > +    if (old_ptl) {
> > > > > +    pmd_t pmd;
> > > > > +
&

Re: [PATCH 2/4] mm: speed up mremap by 500x on large regions (v2)

2018-10-24 Thread Kirill A. Shutemov
On Fri, Oct 12, 2018 at 06:31:58PM -0700, Joel Fernandes (Google) wrote:
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 9e68a02a52b1..2fd163cff406 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -191,6 +191,54 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t 
> *old_pmd,
>   drop_rmap_locks(vma);
>  }
>  
> +static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long 
> old_addr,
> +   unsigned long new_addr, unsigned long old_end,
> +   pmd_t *old_pmd, pmd_t *new_pmd, bool *need_flush)
> +{
> + spinlock_t *old_ptl, *new_ptl;
> + struct mm_struct *mm = vma->vm_mm;
> +
> + if ((old_addr & ~PMD_MASK) || (new_addr & ~PMD_MASK)
> + || old_end - old_addr < PMD_SIZE)
> + return false;
> +
> + /*
> +  * The destination pmd shouldn't be established, free_pgtables()
> +  * should have release it.
> +  */
> + if (WARN_ON(!pmd_none(*new_pmd)))
> + return false;
> +
> + /*
> +  * We don't have to worry about the ordering of src and dst
> +  * ptlocks because exclusive mmap_sem prevents deadlock.
> +  */
> + old_ptl = pmd_lock(vma->vm_mm, old_pmd);
> + if (old_ptl) {

How can it ever be false?

> + pmd_t pmd;
> +
> + new_ptl = pmd_lockptr(mm, new_pmd);
> + if (new_ptl != old_ptl)
> + spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
> +
> + /* Clear the pmd */
> + pmd = *old_pmd;
> + pmd_clear(old_pmd);
> +
> + VM_BUG_ON(!pmd_none(*new_pmd));
> +
> + /* Set the new pmd */
> + set_pmd_at(mm, new_addr, new_pmd, pmd);
> + if (new_ptl != old_ptl)
> +     spin_unlock(new_ptl);
> + spin_unlock(old_ptl);
> +
> + *need_flush = true;
> + return true;
> + }
> + return false;
> +}
> +
-- 
 Kirill A. Shutemov


Re: [PATCH 2/4] mm: speed up mremap by 500x on large regions (v2)

2018-10-24 Thread Kirill A. Shutemov
On Wed, Oct 24, 2018 at 10:57:33PM +1100, Balbir Singh wrote:
> On Wed, Oct 24, 2018 at 01:12:56PM +0300, Kirill A. Shutemov wrote:
> > On Fri, Oct 12, 2018 at 06:31:58PM -0700, Joel Fernandes (Google) wrote:
> > > diff --git a/mm/mremap.c b/mm/mremap.c
> > > index 9e68a02a52b1..2fd163cff406 100644
> > > --- a/mm/mremap.c
> > > +++ b/mm/mremap.c
> > > @@ -191,6 +191,54 @@ static void move_ptes(struct vm_area_struct *vma, 
> > > pmd_t *old_pmd,
> > >   drop_rmap_locks(vma);
> > >  }
> > >  
> > > +static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long 
> > > old_addr,
> > > +   unsigned long new_addr, unsigned long old_end,
> > > +   pmd_t *old_pmd, pmd_t *new_pmd, bool *need_flush)
> > > +{
> > > + spinlock_t *old_ptl, *new_ptl;
> > > + struct mm_struct *mm = vma->vm_mm;
> > > +
> > > + if ((old_addr & ~PMD_MASK) || (new_addr & ~PMD_MASK)
> > > + || old_end - old_addr < PMD_SIZE)
> > > + return false;
> > > +
> > > + /*
> > > +  * The destination pmd shouldn't be established, free_pgtables()
> > > +  * should have release it.
> > > +  */
> > > + if (WARN_ON(!pmd_none(*new_pmd)))
> > > + return false;
> > > +
> > > + /*
> > > +  * We don't have to worry about the ordering of src and dst
> > > +  * ptlocks because exclusive mmap_sem prevents deadlock.
> > > +  */
> > > + old_ptl = pmd_lock(vma->vm_mm, old_pmd);
> > > + if (old_ptl) {
> > 
> > How can it ever be false?
> > 
> > > + pmd_t pmd;
> > > +
> > > +     new_ptl = pmd_lockptr(mm, new_pmd);
> 
> 
> Looks like this is largely inspired by move_huge_pmd(), I guess a lot of
> the code applies, why not just reuse as much as possible? The same comments
> w.r.t mmap_sem helping protect against lock order issues applies as well.

pmd_lock() cannot fail, but __pmd_trans_huge_lock() can. We should not
copy the code blindly.

-- 
 Kirill A. Shutemov


Re: [PATCH 1/4] treewide: remove unused address argument from pte_alloc functions (v2)

2018-10-25 Thread Kirill A. Shutemov
On Wed, Oct 24, 2018 at 10:37:16AM +0200, Peter Zijlstra wrote:
> On Fri, Oct 12, 2018 at 06:31:57PM -0700, Joel Fernandes (Google) wrote:
> > This series speeds up mremap(2) syscall by copying page tables at the
> > PMD level even for non-THP systems. There is concern that the extra
> > 'address' argument that mremap passes to pte_alloc may do something
> > subtle architecture related in the future that may make the scheme not
> > work.  Also we find that there is no point in passing the 'address' to
> > pte_alloc since its unused. So this patch therefore removes this
> > argument tree-wide resulting in a nice negative diff as well. Also
> > ensuring along the way that the enabled architectures do not do anything
> > funky with 'address' argument that goes unnoticed by the optimization.
> 
> Did you happen to look at the history of where that address argument
> came from? -- just being curious here. ISTR something vague about
> architectures having different paging structure for different memory
> ranges.

I see some archicetures (i.e. sparc and, I believe power) used the address
for coloring. It's not needed anymore. Page allocator and SL?B are good
enough now.

See 3c936465249f ("[SPARC64]: Kill pgtable quicklists and use SLAB.")

-- 
 Kirill A. Shutemov


Re: [PATCH 2/4] mm: speed up mremap by 500x on large regions (v2)

2018-10-25 Thread Kirill A. Shutemov
On Wed, Oct 24, 2018 at 07:09:07PM -0700, Joel Fernandes wrote:
> On Wed, Oct 24, 2018 at 03:57:24PM +0300, Kirill A. Shutemov wrote:
> > On Wed, Oct 24, 2018 at 10:57:33PM +1100, Balbir Singh wrote:
> > > On Wed, Oct 24, 2018 at 01:12:56PM +0300, Kirill A. Shutemov wrote:
> > > > On Fri, Oct 12, 2018 at 06:31:58PM -0700, Joel Fernandes (Google) wrote:
> > > > > diff --git a/mm/mremap.c b/mm/mremap.c
> > > > > index 9e68a02a52b1..2fd163cff406 100644
> > > > > --- a/mm/mremap.c
> > > > > +++ b/mm/mremap.c
> > > > > @@ -191,6 +191,54 @@ static void move_ptes(struct vm_area_struct 
> > > > > *vma, pmd_t *old_pmd,
> > > > >   drop_rmap_locks(vma);
> > > > >  }
> > > > >  
> > > > > +static bool move_normal_pmd(struct vm_area_struct *vma, unsigned 
> > > > > long old_addr,
> > > > > +   unsigned long new_addr, unsigned long old_end,
> > > > > +   pmd_t *old_pmd, pmd_t *new_pmd, bool *need_flush)
> > > > > +{
> > > > > + spinlock_t *old_ptl, *new_ptl;
> > > > > + struct mm_struct *mm = vma->vm_mm;
> > > > > +
> > > > > + if ((old_addr & ~PMD_MASK) || (new_addr & ~PMD_MASK)
> > > > > + || old_end - old_addr < PMD_SIZE)
> > > > > + return false;
> > > > > +
> > > > > + /*
> > > > > +  * The destination pmd shouldn't be established, free_pgtables()
> > > > > +  * should have release it.
> > > > > +  */
> > > > > + if (WARN_ON(!pmd_none(*new_pmd)))
> > > > > + return false;
> > > > > +
> > > > > + /*
> > > > > +  * We don't have to worry about the ordering of src and dst
> > > > > +  * ptlocks because exclusive mmap_sem prevents deadlock.
> > > > > +  */
> > > > > + old_ptl = pmd_lock(vma->vm_mm, old_pmd);
> > > > > + if (old_ptl) {
> > > > 
> > > > How can it ever be false?
> 
> Kirill,
> It cannot, you are right. I'll remove the test.
> 
> By the way, there are new changes upstream by Linus which flush the TLB
> before releasing the ptlock instead of after. I'm guessing that patch came
> about because of reviews of this patch and someone spotted an issue in the
> existing code :)
> 
> Anyway the patch in concern is:
> eb66ae030829 ("mremap: properly flush TLB before releasing the page")
> 
> I need to rebase on top of that with appropriate modifications, but I worry
> that this patch will slow down performance since we have to flush at every
> PMD/PTE move before releasing the ptlock. Where as with my patch, the
> intention is to flush only at once in the end of move_page_tables. When I
> tried to flush TLB on every PMD move, it was quite slow on my arm64 device 
> [2].
> 
> Further observation [1] is, it seems like the move_huge_pmds and move_ptes 
> code
> is a bit sub optimal in the sense, we are acquiring and releasing the same
> ptlock for a bunch of PMDs if the said PMDs are on the same page-table page
> right? Instead we can do better by acquiring and release the ptlock less
> often.
> 
> I think this observation [1] and the frequent TLB flush issue [2] can be 
> solved
> by acquiring the ptlock once for a bunch of PMDs, move them all, then flush
> the tlb and then release the ptlock, and then proceed to doing the same thing
> for the PMDs in the next page-table page. What do you think?

Yeah, that's viable optimization.

The tricky part is that one PMD page table can have PMD entires of
different types: THP, page table that you can move as whole and the one
that you cannot (for any reason).

If we cannot move the PMD entry as a whole and must go to PTE page table
we would need to drop PMD ptl and take PTE ptl (it might be the same lock
in some configuations).

Also we don't want to take PMD lock unless it's required.

I expect it to be not very trivial to get everything right. But take a
shot :)

-- 
 Kirill A. Shutemov


Re: [PATCH] mm/zsmalloc.c: Fix zsmalloc 32-bit PAE support

2018-12-10 Thread Kirill A. Shutemov
On Mon, Dec 10, 2018 at 12:21:05PM -0200, Rafael David Tinoco wrote:
> diff --git a/arch/x86/include/asm/pgtable_64_types.h 
> b/arch/x86/include/asm/pgtable_64_types.h
> index 84bd9bdc1987..d808cfde3d19 100644
> --- a/arch/x86/include/asm/pgtable_64_types.h
> +++ b/arch/x86/include/asm/pgtable_64_types.h
> @@ -64,8 +64,6 @@ extern unsigned int ptrs_per_p4d;
>  #define P4D_SIZE (_AC(1, UL) << P4D_SHIFT)
>  #define P4D_MASK (~(P4D_SIZE - 1))
>  
> -#define MAX_POSSIBLE_PHYSMEM_BITS52
> -
>  #else /* CONFIG_X86_5LEVEL */
>  
>  /*
> @@ -154,4 +152,6 @@ extern unsigned int ptrs_per_p4d;
>  
>  #define PGD_KERNEL_START ((PAGE_SIZE / 2) / sizeof(pgd_t))
>  
> +#define MAX_POSSIBLE_PHYSMEM_BITS (pgtable_l5_enabled() ? 52 : 46)
> +

...

>  #endif /* _ASM_X86_PGTABLE_64_DEFS_H */
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index 0787d33b80d8..132c20b6fd4f 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c

...

> @@ -116,6 +100,25 @@
>   */
>  #define OBJ_ALLOCATED_TAG 1
>  #define OBJ_TAG_BITS 1
> +
> +/*
> + * MAX_POSSIBLE_PHYSMEM_BITS should be defined by all archs using zsmalloc:
> + * Trying to guess it from MAX_PHYSMEM_BITS, or considering it BITS_PER_LONG,
> + * proved to be wrong by not considering PAE capabilities, or using SPARSEMEM
> + * only headers, leading to bad object encoding due to object index overflow.
> + */
> +#ifndef MAX_POSSIBLE_PHYSMEM_BITS
> + #define MAX_POSSIBLE_PHYSMEM_BITS BITS_PER_LONG
> + #error "MAX_POSSIBLE_PHYSMEM_BITS HAS to be defined by arch using zsmalloc";
> +#else
> + #ifndef CONFIG_64BIT
> +  #if (MAX_POSSIBLE_PHYSMEM_BITS >= (BITS_PER_LONG + PAGE_SHIFT - 
> OBJ_TAG_BITS))
> +   #error "MAX_POSSIBLE_PHYSMEM_BITS is wrong for this arch";
> +  #endif
> + #endif
> +#endif
> +
> +#define _PFN_BITS (MAX_POSSIBLE_PHYSMEM_BITS - PAGE_SHIFT)
>  #define OBJ_INDEX_BITS   (BITS_PER_LONG - _PFN_BITS - OBJ_TAG_BITS)
>  #define OBJ_INDEX_MASK   ((_AC(1, UL) << OBJ_INDEX_BITS) - 1)

Have you tested it with CONFIG_X86_5LEVEL=y?

ASAICS, the patch makes OBJ_INDEX_BITS and what depends from it dynamic --
it depends what paging mode we are booting in. ZS_SIZE_CLASSES depends
indirectly on OBJ_INDEX_BITS and I don't see how struct zs_pool definition
can compile with dynamic ZS_SIZE_CLASSES.

Hm?

-- 
 Kirill A. Shutemov


Re: [Update] Regression in 4.18 - 32-bit PowerPC crashes on boot - bisected to commit 1d40a5ea01d5

2018-06-29 Thread Kirill A. Shutemov
On Fri, Jun 29, 2018 at 02:01:46PM -0700, Linus Torvalds wrote:
> On Fri, Jun 29, 2018 at 1:42 PM Larry Finger  
> wrote:
> >
> > I have more information regarding this BUG. Line 700 of page-flags.h is the
> > macro PAGE_TYPE_OPS(Table, table). For further debugging, I manually 
> > expanded
> > the macro, and found that the bug line is VM_BUG_ON_PAGE(!PageTable(page), 
> > page)
> > in routine __ClearPageTable(), which is called from pgtable_page_dtor() in
> > include/linux/mm.h. I also added a printk call to PageTable() that logs
> > page->page_type. The routine was called twice. The first had page_type of
> > 0xfbff, which would have been expected for a . The second call had
> > 0x, which led to the BUG.
> 
> So it looks to me like the tear-down of the page tables first found a
> page that is indeed a page table, and cleared the page table bit
> (well, it set it - the bits are reversed).
> 
> Then it took an exception (that "interrupt: 700") and that causes
> do_exit() again, and it tries to free the same page table - and now
> it's no longer marked as a page table, because it already went through
> the __ClearPageTable() dance once.
> 
> So on the second path through, it catches that "the bit already said
> it wasn't a page table" and does the BUG.
> 
> But the real question is what the problem was the *first* time around.

+Aneesh.

Looks like pgtable_page_dtor() gets called in __pte_free_tlb() path twice.
Once in __pte_free_tlb() itself and the second time in pgtable_free().

Would this help?

diff --git a/arch/powerpc/include/asm/book3s/32/pgalloc.h 
b/arch/powerpc/include/asm/book3s/32/pgalloc.h
index 6a6673907e45..e7a2f0e6b695 100644
--- a/arch/powerpc/include/asm/book3s/32/pgalloc.h
+++ b/arch/powerpc/include/asm/book3s/32/pgalloc.h
@@ -137,7 +137,6 @@ static inline void pgtable_free_tlb(struct mmu_gather *tlb,
 static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t table,
  unsigned long address)
 {
-   pgtable_page_dtor(table);
pgtable_free_tlb(tlb, page_address(table), 0);
 }
 #endif /* _ASM_POWERPC_BOOK3S_32_PGALLOC_H */
diff --git a/arch/powerpc/include/asm/nohash/32/pgalloc.h 
b/arch/powerpc/include/asm/nohash/32/pgalloc.h
index 1707781d2f20..30a13b80fd58 100644
--- a/arch/powerpc/include/asm/nohash/32/pgalloc.h
+++ b/arch/powerpc/include/asm/nohash/32/pgalloc.h
@@ -139,7 +139,6 @@ static inline void __pte_free_tlb(struct mmu_gather *tlb, 
pgtable_t table,
  unsigned long address)
 {
tlb_flush_pgtable(tlb, address);
-   pgtable_page_dtor(table);
pgtable_free_tlb(tlb, page_address(table), 0);
 }
 #endif /* _ASM_POWERPC_PGALLOC_32_H */
-- 
 Kirill A. Shutemov


Re: 5-level pagetable patches break ppc64le

2017-03-13 Thread Kirill A. Shutemov
On Mon, Mar 13, 2017 at 09:05:50PM +1100, Anton Blanchard wrote:
> Hi,
> 
> My ppc64le boot tests stopped working as of commit c2febafc6773 ("mm:
> convert generic code to 5-level paging")
> 
> We hang part way during boot, just before bringing up the network. I
> haven't had a chance to narrow it down yet.

Please check if patch by this link helps:

http://lkml.kernel.org/r/20170313052213.11411-1-kirill.shute...@linux.intel.com

-- 
 Kirill A. Shutemov


Re: [PATCH v6 00/24] Speculative page faults

2018-01-16 Thread Kirill A. Shutemov
On Fri, Jan 12, 2018 at 06:25:44PM +0100, Laurent Dufour wrote:
> --
> Benchmarks results
> 
> Base kernel is 4.15-rc6-mmotm-2018-01-04-16-19
> SPF is BASE + this series

Do you have THP=always here? Lack of THP support worries me.

What is performance in the worst case scenario? Like when we go far enough into
speculative code path on every page fault and then fallback to normal page
fault?

-- 
 Kirill A. Shutemov


Re: [mainline][Memory off/on][83e3c48] kernel Oops with memory hot-unplug on ppc

2018-02-19 Thread Kirill A. Shutemov
On Mon, Feb 19, 2018 at 02:47:35PM +0100, Michal Hocko wrote:
> [CC Kirill - I have a vague recollection that there were some follow ups
>  for 83e3c48729d9 ("mm/sparsemem: Allocate mem_section at runtime for
>  CONFIG_SPARSEMEM_EXTREME=y"). Does any of them apply to this issue?]

All fixups are in v4.15.

> On Mon 05-02-18 19:54:24, Abdul Haleem wrote:
> > 
> > Greetings,
> > 
> > Kernel Oops seen when memory hot-unplug on powerpc mainline kernel.
> > 
> > Machine: Power6 PowerVM ppc64
> > Kernel: 4.15.0
> > Config: attached
> > gcc: 4.8.2
> > Test: Memory hot-unplug of a memory block
> > echo offline > /sys/devices/system/memory/memory/state
> > 
> > The faulty instruction address points to the code path:
> > 
> > # gdb -batch vmlinux -ex 'list *(0xc0238330)'
> > 0xc0238330 is in get_pfnblock_flags_mask
> > (./include/linux/mmzone.h:1157).
> > 1152#endif
> > 1153
> > 1154static inline struct mem_section *__nr_to_section(unsigned long 
> > nr)
> > 1155{
> > 1156#ifdef CONFIG_SPARSEMEM_EXTREME
> > 1157if (!mem_section)
> > 1158return NULL;
> > 1159#endif
> > 1160if (!mem_section[SECTION_NR_TO_ROOT(nr)])
> > 1161return NULL;
> > 
> > 
> > The code was first introduced with commit( 83e3c48: mm/sparsemem:
> > Allocate mem_section at runtime for CONFIG_SPARSEMEM_EXTREME=y)

Any chance to bisect it?

Could you check if the commit just before 83e3c48729d9 is fine?

-- 
 Kirill A. Shutemov


Re: [PATCH] selftests/vm: Update max va test to check for high address return.

2018-02-28 Thread Kirill A. Shutemov
On Wed, Feb 28, 2018 at 09:28:30AM +0530, Aneesh Kumar K.V wrote:
> mmap(-1,..) is expected to search from max supported VA top down. It should 
> find
> an address above ADDR_SWITCH_HINT. Explicitly check for this.

Hm. I don't think this correct. -1 means the application supports any
address, not restricted to 47-bit address space. It doesn't mean the
application *require* the address to be above 47-bit.

At least on x86, -1 just shift upper boundary of address range where we
can look for unmapped area.

-- 
 Kirill A. Shutemov


Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure

2017-08-26 Thread Kirill A. Shutemov
steful to me.

Have you considered to look how we can hand over from speculative to
non-speculative path without starting from scratch (when possible)?

> + /* Transparent huge pages are not supported. */
> + if (unlikely(pmd_trans_huge(*pmd)))
> + goto out_walk;

That's looks like a blocker to me.

Is there any problem with making it supported (besides plain coding)?

> +
> + vmf.vma = vma;
> + vmf.pmd = pmd;
> + vmf.pgoff = linear_page_index(vma, address);
> + vmf.gfp_mask = __get_fault_gfp_mask(vma);
> + vmf.sequence = seq;
> + vmf.flags = flags;
> +
> + local_irq_enable();
> +
> + /*
> +  * We need to re-validate the VMA after checking the bounds, otherwise
> +  * we might have a false positive on the bounds.
> +  */
> + if (read_seqcount_retry(&vma->vm_sequence, seq))
> + goto unlock;
> +
> + ret = handle_pte_fault(&vmf);
> +
> +unlock:
> + srcu_read_unlock(&vma_srcu, idx);
> + return ret;
> +
> +out_walk:
> + local_irq_enable();
> + goto unlock;
> +}
> +#endif /* __HAVE_ARCH_CALL_SPF */
> +
>  /*
>   * By the time we get here, we already hold the mm semaphore
>   *
> -- 
> 2.7.4
> 

-- 
 Kirill A. Shutemov


Re: [PATCH] mm: remove unnecessary WARN_ONCE in page_vma_mapped_walk().

2017-10-03 Thread Kirill A. Shutemov
On Tue, Oct 03, 2017 at 10:26:06AM -0400, Zi Yan wrote:
> From: Zi Yan 
> 
> A non present pmd entry can appear after pmd_lock is taken in
> page_vma_mapped_walk(), even if THP migration is not enabled.
> The WARN_ONCE is unnecessary.
> 
> Fixes: 616b8371539a ("mm: thp: enable thp migration in generic path")
> Reported-and-tested-by: Abdul Haleem 
> Signed-off-by: Zi Yan 
> Cc: "Kirill A. Shutemov" 
> Cc: Anshuman Khandual 
> Cc: Andrew Morton 

Acked-by: Kirill A. Shutemov 

-- 
 Kirill A. Shutemov


Re: POWER: Unexpected fault when writing to brk-allocated memory

2017-11-07 Thread Kirill A. Shutemov
On Tue, Nov 07, 2017 at 04:07:05PM +1100, Nicholas Piggin wrote:
> C'ing everyone who was on the x86 56-bit user virtual address patch.
> 
> I think we need more time to discuss this behaviour, in light of the
> regression Florian uncovered. I would propose we turn off the 56-bit
> user virtual address support for x86 for 4.14, and powerpc would
> follow and turn off its 512T support until we can get a better handle
> on the problems. (Actually Florian initially hit a couple of bugs in
> powerpc implementation, but pulling that string uncovers a whole lot
> of difficulties.)
> 
> The bi-modal behavior switched based on a combination of mmap address
> hint and MAP_FIXED just sucks. It's segregating our VA space with
> some non-standard heuristics, and it doesn't seem to work very well.
> 
> What are we trying to do? Allow SAP HANA etc use huge address spaces
> by coding to these specific mmap heuristics we're going to add,
> rather than solving it properly in a way that requires adding a new
> syscall or personality or prctl or sysctl. Okay, but the cost is that
> despite best efforts, it still changes ABI behaviour for existing
> applications and these heuristics will become baked into the ABI that
> we will have to support. Not a good tradeoff IMO.
> 
> First of all, using addr and MAP_FIXED to develop our heuristic can
> never really give unchanged ABI. It's an in-band signal. brk() is a
> good example that steadily keeps incrementing address, so depending
> on malloc usage and address space randomization, you will get a brk()
> that ends exactly at 128T, then the next one will be >
> DEFAULT_MAP_WINDOW, and it will switch you to 56 bit address space.

No, it won't. You will hit stack first.

> Second, the kernel can never completely solve the problem this way.
> How do we know a malloc library will not ask for > 128TB addresses
> and pass them to an unknowing application?

The idea is that an application can provide hint (mallopt() ?) to malloc
implementation that it's ready to full address space. In this case, malloc
can use mmap((void *) -1,...) for its allocations and get full address
space this way.

> And lastly, there are a fair few bugs and places where description
> in changelogs and mailing lists does not match code. You don't want
> to know the mess in powerpc, but even x86 has two I can see:
> MAP_FIXED succeeds even when crossing 128TB addresses (where changelog
> indicated it should not),

Hm. I don't see where the changelog indicated that MAP_FIXED across 128TB
shouldn't work. My intention was that it should, although I haven't stated
it in the changelog.

The idea was we shouldn't allow to slip above 47-bits by accidentally.

Correctly functioning program would never request addr+len above 47-bit
with MAP_FIXED, unless it's ready to handle such addresses. Otherwise the
request would simply fail on machine that doesn't support large VA.

In contrast, addr+len above 47-bit without MAP_FIXED will not fail on
machine that doesn't support large VA, kernel will find another place
under 47-bit. And I can imagine a reasonable application that does
something like this.

So we cannot rely that application is ready to handle large
addresses if we see addr+len without MAP_FIXED.

> arch_get_unmapped_area_topdown() with an address hint is checking
> against TASK_SIZE rather than the limited 128TB address, so it looks
> like it won't follow the heuristics.

You are right. This is broken. If user would request mapping above vdso,
but below DEFAULT_MAP_WINDOW it will succeed.

I'll send patch to fix this. But it doesn't look as a show-stopper to me.

Re-checking things for this reply I found actual bug, see:

http://lkml.kernel.org/r/20171107103804.47341-1-kirill.shute...@linux.intel.com

> So unless everyone else thinks I'm crazy and disagrees, I'd ask for
> a bit more time to make sure we get this interface right. I would
> hope for something like prctl PR_SET_MM which can be used to set
> our user virtual address bits on a fine grained basis. Maybe a
> sysctl, maybe a personality. Something out-of-band. I don't wan to
> get too far into that discussion yet. First we need to agree whether
> or not the code in the tree today is a problem.

Well, we've discussed before all options you are proposing.
Linus wanted a minimalistic interface, so we took this path for now.
We can always add more ways to get access to full address space later.

-- 
 Kirill A. Shutemov


Re: POWER: Unexpected fault when writing to brk-allocated memory

2017-11-07 Thread Kirill A. Shutemov
On Tue, Nov 07, 2017 at 09:15:21AM +0100, Florian Weimer wrote:
> MAP_FIXED is near-impossible to use correctly.  I hope you don't expect
> applications to do that.  If you want address-based opt in, it should work
> without MAP_FIXED.  Sure, in obscure cases, applications might still see
> out-of-range addresses, but I expected a full opt-out based on RLIMIT_AS
> would be sufficient for them.

Just use mmap(-1), without MAP_FIXED to get full address space.

-- 
 Kirill A. Shutemov


Re: POWER: Unexpected fault when writing to brk-allocated memory

2017-11-07 Thread Kirill A. Shutemov
On Tue, Nov 07, 2017 at 12:26:12PM +0100, Florian Weimer wrote:
> On 11/07/2017 12:15 PM, Kirill A. Shutemov wrote:
> 
> > > First of all, using addr and MAP_FIXED to develop our heuristic can
> > > never really give unchanged ABI. It's an in-band signal. brk() is a
> > > good example that steadily keeps incrementing address, so depending
> > > on malloc usage and address space randomization, you will get a brk()
> > > that ends exactly at 128T, then the next one will be >
> > > DEFAULT_MAP_WINDOW, and it will switch you to 56 bit address space.
> > 
> > No, it won't. You will hit stack first.
> 
> That's not actually true on POWER in some cases.  See the process maps I
> posted here:
> 
>   <https://marc.info/?l=linuxppc-embedded&m=150988538106263&w=2>

Hm? I see that in all three cases the [stack] is the last mapping.
Do I miss something?

-- 
 Kirill A. Shutemov


  1   2   >