Re: [PATCH v4 0/8] Avoid cache trashing on clearing huge/gigantic page

2012-09-25 Thread Kirill A. Shutemov
On Fri, Sep 14, 2012 at 07:52:10AM +0200, Ingo Molnar wrote:
 Without repeatable hard numbers such code just gets into the 
 kernel and bitrots there as new CPU generations come in - a few 
 years down the line the original decisions often degrade to pure 
 noise. We've been there, we've done that, we don't want to 
 repeat it.

sorry, for late answer..

Hard numbers are hard.
I've checked some workloads: Mosbench, NPB, specjvm2008. Most of time the
patchset doesn't show any difference (within run-to-run deviation).
On NPB it recovers THP regression, but it's probably not enough to make
decision.

It would be nice if somebody test the patchset on other system or
workload. Especially, if the configuration shows regression with
THP enabled.

-- 
 Kirill A. Shutemov


signature.asc
Description: Digital signature
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v4 0/8] Avoid cache trashing on clearing huge/gigantic page

2012-09-25 Thread Andrea Arcangeli
Hi Kirill,

On Tue, Sep 25, 2012 at 05:27:03PM +0300, Kirill A. Shutemov wrote:
 On Fri, Sep 14, 2012 at 07:52:10AM +0200, Ingo Molnar wrote:
  Without repeatable hard numbers such code just gets into the 
  kernel and bitrots there as new CPU generations come in - a few 
  years down the line the original decisions often degrade to pure 
  noise. We've been there, we've done that, we don't want to 
  repeat it.
 
 sorry, for late answer..
 
 Hard numbers are hard.
 I've checked some workloads: Mosbench, NPB, specjvm2008. Most of time the
 patchset doesn't show any difference (within run-to-run deviation).
 On NPB it recovers THP regression, but it's probably not enough to make
 decision.
 
 It would be nice if somebody test the patchset on other system or
 workload. Especially, if the configuration shows regression with
 THP enabled.

If the only workload that gets a benefit is NPB then we've the proof
this is too hardware dependend to be a conclusive result.

It may have been slower by an accident, things like cache
associativity off by one bit, combined with the implicit coloring
provided to the lowest 512 colors could hurts more if the cache
associativity is low.

I'm saying this because NPB on a thinkpad (Intel CPU I assume) is the
benchmark that shows the most benefit among all benchmarks run on that
hardware.

http://www.phoronix.com/scan.php?page=articleitem=linux_transparent_hugepagesnum=2

I've once seen certain computations that run much slower with perfect
cache coloring but most others runs much faster with the page
coloring. Doesn't mean page coloring is bad per se. So the NPB on that
specific hardware may have been the exception and not the interesting
case. Especially considering the effect of cache-copying is opposite
on slightly different hw.

I think the the static_key should be off by default whenever the CPU
L2 cache size is = the size of the copy (2*HPAGE_PMD_SIZE). Now the
cache does random replacement so maybe we could also allow cache
copies for twice the size of the copy (L2size =
4*HPAGE_PMD_SIZE). Current CPUs have caches much larger than 2*2MB...

It would make a whole lot more sense for hugetlbfs giga pages than for
THP (unlike for THP, cache trashing with giga pages is guaranteed),
but even with giga pages, it's not like they're allocated frequently
(maybe once per OS reboot) so that's also sure totally lost in the
noise as it only saves a few accesses after the cache copy is
finished.

It's good to have tested it though.

Thanks,
Andrea
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v4 0/8] Avoid cache trashing on clearing huge/gigantic page

2012-09-13 Thread Andrew Morton
On Mon, 20 Aug 2012 16:52:29 +0300
Kirill A. Shutemov kirill.shute...@linux.intel.com wrote:

 Clearing a 2MB huge page will typically blow away several levels of CPU
 caches.  To avoid this only cache clear the 4K area around the fault
 address and use a cache avoiding clears for the rest of the 2MB area.
 
 This patchset implements cache avoiding version of clear_page only for
 x86. If an architecture wants to provide cache avoiding version of
 clear_page it should to define ARCH_HAS_USER_NOCACHE to 1 and implement
 clear_page_nocache() and clear_user_highpage_nocache().

Patchset looks nice to me, but the changelogs are terribly short of
performance measurements.  For this sort of change I do think it is
important that pretty exhaustive testing be performed, and that the
results (or a readable summary of them) be shown.  And that testing
should be designed to probe for slowdowns, not just the speedups!


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v4 0/8] Avoid cache trashing on clearing huge/gigantic page

2012-09-13 Thread Ingo Molnar

* Andrew Morton a...@linux-foundation.org wrote:

 On Mon, 20 Aug 2012 16:52:29 +0300
 Kirill A. Shutemov kirill.shute...@linux.intel.com wrote:
 
  Clearing a 2MB huge page will typically blow away several levels of CPU
  caches.  To avoid this only cache clear the 4K area around the fault
  address and use a cache avoiding clears for the rest of the 2MB area.
  
  This patchset implements cache avoiding version of clear_page only for
  x86. If an architecture wants to provide cache avoiding version of
  clear_page it should to define ARCH_HAS_USER_NOCACHE to 1 and implement
  clear_page_nocache() and clear_user_highpage_nocache().
 
 Patchset looks nice to me, but the changelogs are terribly 
 short of performance measurements.  For this sort of change I 
 do think it is important that pretty exhaustive testing be 
 performed, and that the results (or a readable summary of 
 them) be shown.  And that testing should be designed to probe 
 for slowdowns, not just the speedups!

That is my general impression as well.

Firstly, doing before/after perf stat --repeat 3 ... runs 
showing a statistically significant effect on a workload that is 
expected to win from this, and on a workload expected to be 
hurting from this would go a long way towards convincing me.

Secondly, if you can find some user-space simulation of the 
intended positive (and negative) effects then a 'perf bench' 
testcase designed to show weakness of any such approach, running 
the very kernel assembly code in user-space would also be rather 
useful.

See:

comet:~/tip git grep x86 tools/perf/bench/ | grep inclu
tools/perf/bench/mem-memcpy-arch.h:#include mem-memcpy-x86-64-asm-def.h
tools/perf/bench/mem-memcpy-x86-64-asm.S:#include 
../../../arch/x86/lib/memcpy_64.S
tools/perf/bench/mem-memcpy.c:#include mem-memcpy-x86-64-asm-def.h
tools/perf/bench/mem-memset-arch.h:#include mem-memset-x86-64-asm-def.h
tools/perf/bench/mem-memset-x86-64-asm.S:#include 
../../../arch/x86/lib/memset_64.S
tools/perf/bench/mem-memset.c:#include mem-memset-x86-64-asm-def.h

that code uses the kernel-side assembly code and runs it in 
user-space.

Although obviously clearing pages on page faults needs some care 
to properly simulate in user-space.

Without repeatable hard numbers such code just gets into the 
kernel and bitrots there as new CPU generations come in - a few 
years down the line the original decisions often degrade to pure 
noise. We've been there, we've done that, we don't want to 
repeat it.

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v4 0/8] Avoid cache trashing on clearing huge/gigantic page

2012-09-12 Thread Kirill A. Shutemov
Hi,

Any feedback?

-- 
 Kirill A. Shutemov


signature.asc
Description: Digital signature
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v4 0/8] Avoid cache trashing on clearing huge/gigantic page

2012-08-20 Thread Kirill A. Shutemov
From: Kirill A. Shutemov kirill.shute...@linux.intel.com

Clearing a 2MB huge page will typically blow away several levels of CPU
caches.  To avoid this only cache clear the 4K area around the fault
address and use a cache avoiding clears for the rest of the 2MB area.

This patchset implements cache avoiding version of clear_page only for
x86. If an architecture wants to provide cache avoiding version of
clear_page it should to define ARCH_HAS_USER_NOCACHE to 1 and implement
clear_page_nocache() and clear_user_highpage_nocache().

v4:
  - vm.clear_huge_page_nocache sysctl;
  - rework page iteration in clear_{huge,gigantic}_page according to
Andrea Arcangeli suggestion;
v3:
  - Rebased to current Linus' tree. kmap_atomic() build issue is fixed;
  - Pass fault address to clear_huge_page(). v2 had problem with clearing
for sizes other than HPAGE_SIZE;
  - x86: fix 32bit variant. Fallback version of clear_page_nocache() has
been added for non-SSE2 systems;
  - x86: clear_page_nocache() moved to clear_page_{32,64}.S;
  - x86: use pushq_cfi/popq_cfi instead of push/pop;
v2:
  - No code change. Only commit messages are updated;
  - RFC mark is dropped;

Andi Kleen (5):
  THP: Use real address for NUMA policy
  THP: Pass fault address to __do_huge_pmd_anonymous_page()
  x86: Add clear_page_nocache
  mm: make clear_huge_page cache clear only around the fault address
  x86: switch the 64bit uncached page clear to SSE/AVX v2

Kirill A. Shutemov (3):
  hugetlb: pass fault address to hugetlb_no_page()
  mm: pass fault address to clear_huge_page()
  mm: implement vm.clear_huge_page_nocache sysctl

 Documentation/sysctl/vm.txt  |   13 ++
 arch/x86/include/asm/page.h  |2 +
 arch/x86/include/asm/string_32.h |5 ++
 arch/x86/include/asm/string_64.h |5 ++
 arch/x86/lib/Makefile|3 +-
 arch/x86/lib/clear_page_32.S |   72 +++
 arch/x86/lib/clear_page_64.S |   78 ++
 arch/x86/mm/fault.c  |7 +++
 include/linux/mm.h   |7 +++-
 kernel/sysctl.c  |   12 ++
 mm/huge_memory.c |   17 
 mm/hugetlb.c |   39 ++-
 mm/memory.c  |   72 ++
 13 files changed, 294 insertions(+), 38 deletions(-)
 create mode 100644 arch/x86/lib/clear_page_32.S

-- 
1.7.7.6

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev