Changelog since v1 o Drop a pagetable walk that seems redundant o Account for TLB flushes only when debugging o Drop the patch that took number of CPUs to flush into account
ebizzy regressed between 3.4 and 3.10 while testing on a new machine. Bisection initially found at least three problems of which the first was commit 611ae8e3 (x86/tlb: enable tlb flush range support for x86). Second was related to TLB flush accounting. The third was related to ACPI cpufreq and so it was disabled for the purposes of this series. The intent of the TLB range flush series was to preserve existing TLB entries by flushing a range one page at a time instead of flushing the address space. This makes a certain amount of sense if the address space being flushed was known to have existing hot entries. The decision on whether to do a full mm flush or a number of single page flushes depends on the size of the relevant TLB and how many of these hot entries would be preserved by a targeted flush. This implicitly assumes a lot including the following examples o That the full TLB is in use by the task being flushed o The TLB has hot entries that are going to be used in the near future o The TLB has entries for the range being cached o The cost of the per-page flushes is similar to a single mm flush o Large pages are unimportant and can always be globally flushed o Small flushes from workloads are very common The first three are completely unknowable but unfortunately it is something that is probably true of micro benchmarks designed to exercise these paths. The fourth one depends completely on the hardware. I've no idea what the logic behind the large page decision was but it's certainly wrong if automatic NUMA balancing is enabled as it frequently flushes a single THP page. The last one is the strangest because generally only a process that was mapping/unmapping very small regions would hit this. It's possible it is the common case for virtualised workloads that is managing the address space of its guests. Maybe this was the real original motivation of the TLB range flush support for x86. Whatever the reason, Ebizzy sees very little benefit as it discards newly allocated memory very quickly and regressed badly on Ivybridge where it constantly flushes ranges of 128 pages one page at a time. Earlier machines may not have seen this problem as the balance point was at a different location. While I'm wary of optimising for such a benchmark, it's commonly tested and it's apparent that the worst case defaults for Ivybridge need to be re-examined. The following small series restores ebizzy to 3.4-era performance for the very limited set of machines tested. ebizzy 3.13.0-rc3 3.4.69 3.13.0-rc3 3.13.0-rc3 thread vanilla vanilla altershift-v2r1 nowalk-v2r7 Mean 1 7377.91 ( 0.00%) 6812.38 ( -7.67%) 7784.45 ( 5.51%) 7804.08 ( 5.78%) Mean 2 8262.07 ( 0.00%) 8276.75 ( 0.18%) 9437.49 ( 14.23%) 9450.88 ( 14.39%) Mean 3 7895.00 ( 0.00%) 8002.84 ( 1.37%) 8875.38 ( 12.42%) 8914.60 ( 12.91%) Mean 4 7658.74 ( 0.00%) 7824.83 ( 2.17%) 8509.10 ( 11.10%) 8399.43 ( 9.67%) Mean 5 7275.37 ( 0.00%) 7678.74 ( 5.54%) 8208.94 ( 12.83%) 8197.86 ( 12.68%) Mean 6 6875.50 ( 0.00%) 7597.18 ( 10.50%) 7755.66 ( 12.80%) 7807.51 ( 13.56%) Mean 7 6722.48 ( 0.00%) 7584.75 ( 12.83%) 7456.93 ( 10.93%) 7480.74 ( 11.28%) Mean 8 6559.55 ( 0.00%) 7591.51 ( 15.73%) 6879.01 ( 4.87%) 6881.86 ( 4.91%) Stddev 1 50.55 ( 0.00%) 78.05 (-54.41%) 44.70 ( 11.58%) 39.22 ( 22.41%) Stddev 2 37.98 ( 0.00%) 176.92 (-365.76%) 92.40 (-143.26%) 184.32 (-385.24%) Stddev 3 55.76 ( 0.00%) 126.02 (-126.00%) 99.79 (-78.95%) 32.97 ( 40.87%) Stddev 4 64.64 ( 0.00%) 117.09 (-81.13%) 124.23 (-92.17%) 212.67 (-229.00%) Stddev 5 131.53 ( 0.00%) 92.86 ( 29.39%) 108.07 ( 17.83%) 101.05 ( 23.17%) Stddev 6 109.92 ( 0.00%) 74.87 ( 31.89%) 179.26 (-63.08%) 202.56 (-84.28%) Stddev 7 124.32 ( 0.00%) 72.25 ( 41.88%) 124.46 ( -0.12%) 128.52 ( -3.38%) Stddev 8 60.98 ( 0.00%) 60.98 ( -0.00%) 62.31 ( -2.19%) 63.73 ( -4.51%) Machine was a single socket machine with number of threads tested ranging from 1 to NR_CPUS. For each thread, there were 100 iterations and the reported mean and stddev was based on those iterations. The results are unfortunately noisy but many of the gains are well outside 1 standard deviation. The test is dominated by the address space allocation, page allocation and zeroing of the pages with the flush being a relatively small component of the workload. It was suggested to remove the per-family TLB shifts entirely but the figures must have been based on some testing by someone somewhere using a representative workload. Details on that would be nice but in the meantime I only altered IvyBridge as the balance point happens to be where ebizzy becomes an adverse workload. arch/x86/include/asm/tlbflush.h | 6 ++--- arch/x86/kernel/cpu/intel.c | 2 +- arch/x86/kernel/cpu/mtrr/generic.c | 4 +-- arch/x86/mm/tlb.c | 52 ++++++++++---------------------------- include/linux/vm_event_item.h | 4 +-- include/linux/vmstat.h | 8 ++++++ 6 files changed, 29 insertions(+), 47 deletions(-) -- 1.8.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/