Re: [PATCH v4 0/8] Avoid cache trashing on clearing huge/gigantic page
On Fri, Sep 14, 2012 at 07:52:10AM +0200, Ingo Molnar wrote: Without repeatable hard numbers such code just gets into the kernel and bitrots there as new CPU generations come in - a few years down the line the original decisions often degrade to pure noise. We've been there, we've done that, we don't want to repeat it. sorry, for late answer.. Hard numbers are hard. I've checked some workloads: Mosbench, NPB, specjvm2008. Most of time the patchset doesn't show any difference (within run-to-run deviation). On NPB it recovers THP regression, but it's probably not enough to make decision. It would be nice if somebody test the patchset on other system or workload. Especially, if the configuration shows regression with THP enabled. -- Kirill A. Shutemov signature.asc Description: Digital signature ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v4 0/8] Avoid cache trashing on clearing huge/gigantic page
Hi Kirill, On Tue, Sep 25, 2012 at 05:27:03PM +0300, Kirill A. Shutemov wrote: On Fri, Sep 14, 2012 at 07:52:10AM +0200, Ingo Molnar wrote: Without repeatable hard numbers such code just gets into the kernel and bitrots there as new CPU generations come in - a few years down the line the original decisions often degrade to pure noise. We've been there, we've done that, we don't want to repeat it. sorry, for late answer.. Hard numbers are hard. I've checked some workloads: Mosbench, NPB, specjvm2008. Most of time the patchset doesn't show any difference (within run-to-run deviation). On NPB it recovers THP regression, but it's probably not enough to make decision. It would be nice if somebody test the patchset on other system or workload. Especially, if the configuration shows regression with THP enabled. If the only workload that gets a benefit is NPB then we've the proof this is too hardware dependend to be a conclusive result. It may have been slower by an accident, things like cache associativity off by one bit, combined with the implicit coloring provided to the lowest 512 colors could hurts more if the cache associativity is low. I'm saying this because NPB on a thinkpad (Intel CPU I assume) is the benchmark that shows the most benefit among all benchmarks run on that hardware. http://www.phoronix.com/scan.php?page=articleitem=linux_transparent_hugepagesnum=2 I've once seen certain computations that run much slower with perfect cache coloring but most others runs much faster with the page coloring. Doesn't mean page coloring is bad per se. So the NPB on that specific hardware may have been the exception and not the interesting case. Especially considering the effect of cache-copying is opposite on slightly different hw. I think the the static_key should be off by default whenever the CPU L2 cache size is = the size of the copy (2*HPAGE_PMD_SIZE). Now the cache does random replacement so maybe we could also allow cache copies for twice the size of the copy (L2size = 4*HPAGE_PMD_SIZE). Current CPUs have caches much larger than 2*2MB... It would make a whole lot more sense for hugetlbfs giga pages than for THP (unlike for THP, cache trashing with giga pages is guaranteed), but even with giga pages, it's not like they're allocated frequently (maybe once per OS reboot) so that's also sure totally lost in the noise as it only saves a few accesses after the cache copy is finished. It's good to have tested it though. Thanks, Andrea ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v4 0/8] Avoid cache trashing on clearing huge/gigantic page
On Mon, 20 Aug 2012 16:52:29 +0300 Kirill A. Shutemov kirill.shute...@linux.intel.com wrote: Clearing a 2MB huge page will typically blow away several levels of CPU caches. To avoid this only cache clear the 4K area around the fault address and use a cache avoiding clears for the rest of the 2MB area. This patchset implements cache avoiding version of clear_page only for x86. If an architecture wants to provide cache avoiding version of clear_page it should to define ARCH_HAS_USER_NOCACHE to 1 and implement clear_page_nocache() and clear_user_highpage_nocache(). Patchset looks nice to me, but the changelogs are terribly short of performance measurements. For this sort of change I do think it is important that pretty exhaustive testing be performed, and that the results (or a readable summary of them) be shown. And that testing should be designed to probe for slowdowns, not just the speedups! ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v4 0/8] Avoid cache trashing on clearing huge/gigantic page
* Andrew Morton a...@linux-foundation.org wrote: On Mon, 20 Aug 2012 16:52:29 +0300 Kirill A. Shutemov kirill.shute...@linux.intel.com wrote: Clearing a 2MB huge page will typically blow away several levels of CPU caches. To avoid this only cache clear the 4K area around the fault address and use a cache avoiding clears for the rest of the 2MB area. This patchset implements cache avoiding version of clear_page only for x86. If an architecture wants to provide cache avoiding version of clear_page it should to define ARCH_HAS_USER_NOCACHE to 1 and implement clear_page_nocache() and clear_user_highpage_nocache(). Patchset looks nice to me, but the changelogs are terribly short of performance measurements. For this sort of change I do think it is important that pretty exhaustive testing be performed, and that the results (or a readable summary of them) be shown. And that testing should be designed to probe for slowdowns, not just the speedups! That is my general impression as well. Firstly, doing before/after perf stat --repeat 3 ... runs showing a statistically significant effect on a workload that is expected to win from this, and on a workload expected to be hurting from this would go a long way towards convincing me. Secondly, if you can find some user-space simulation of the intended positive (and negative) effects then a 'perf bench' testcase designed to show weakness of any such approach, running the very kernel assembly code in user-space would also be rather useful. See: comet:~/tip git grep x86 tools/perf/bench/ | grep inclu tools/perf/bench/mem-memcpy-arch.h:#include mem-memcpy-x86-64-asm-def.h tools/perf/bench/mem-memcpy-x86-64-asm.S:#include ../../../arch/x86/lib/memcpy_64.S tools/perf/bench/mem-memcpy.c:#include mem-memcpy-x86-64-asm-def.h tools/perf/bench/mem-memset-arch.h:#include mem-memset-x86-64-asm-def.h tools/perf/bench/mem-memset-x86-64-asm.S:#include ../../../arch/x86/lib/memset_64.S tools/perf/bench/mem-memset.c:#include mem-memset-x86-64-asm-def.h that code uses the kernel-side assembly code and runs it in user-space. Although obviously clearing pages on page faults needs some care to properly simulate in user-space. Without repeatable hard numbers such code just gets into the kernel and bitrots there as new CPU generations come in - a few years down the line the original decisions often degrade to pure noise. We've been there, we've done that, we don't want to repeat it. Thanks, Ingo ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v4 0/8] Avoid cache trashing on clearing huge/gigantic page
Hi, Any feedback? -- Kirill A. Shutemov signature.asc Description: Digital signature ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v4 0/8] Avoid cache trashing on clearing huge/gigantic page
From: Kirill A. Shutemov kirill.shute...@linux.intel.com Clearing a 2MB huge page will typically blow away several levels of CPU caches. To avoid this only cache clear the 4K area around the fault address and use a cache avoiding clears for the rest of the 2MB area. This patchset implements cache avoiding version of clear_page only for x86. If an architecture wants to provide cache avoiding version of clear_page it should to define ARCH_HAS_USER_NOCACHE to 1 and implement clear_page_nocache() and clear_user_highpage_nocache(). v4: - vm.clear_huge_page_nocache sysctl; - rework page iteration in clear_{huge,gigantic}_page according to Andrea Arcangeli suggestion; v3: - Rebased to current Linus' tree. kmap_atomic() build issue is fixed; - Pass fault address to clear_huge_page(). v2 had problem with clearing for sizes other than HPAGE_SIZE; - x86: fix 32bit variant. Fallback version of clear_page_nocache() has been added for non-SSE2 systems; - x86: clear_page_nocache() moved to clear_page_{32,64}.S; - x86: use pushq_cfi/popq_cfi instead of push/pop; v2: - No code change. Only commit messages are updated; - RFC mark is dropped; Andi Kleen (5): THP: Use real address for NUMA policy THP: Pass fault address to __do_huge_pmd_anonymous_page() x86: Add clear_page_nocache mm: make clear_huge_page cache clear only around the fault address x86: switch the 64bit uncached page clear to SSE/AVX v2 Kirill A. Shutemov (3): hugetlb: pass fault address to hugetlb_no_page() mm: pass fault address to clear_huge_page() mm: implement vm.clear_huge_page_nocache sysctl Documentation/sysctl/vm.txt | 13 ++ arch/x86/include/asm/page.h |2 + arch/x86/include/asm/string_32.h |5 ++ arch/x86/include/asm/string_64.h |5 ++ arch/x86/lib/Makefile|3 +- arch/x86/lib/clear_page_32.S | 72 +++ arch/x86/lib/clear_page_64.S | 78 ++ arch/x86/mm/fault.c |7 +++ include/linux/mm.h |7 +++- kernel/sysctl.c | 12 ++ mm/huge_memory.c | 17 mm/hugetlb.c | 39 ++- mm/memory.c | 72 ++ 13 files changed, 294 insertions(+), 38 deletions(-) create mode 100644 arch/x86/lib/clear_page_32.S -- 1.7.7.6 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev