Here are the results using our internal benchmarks which are a mixed a multi-threaded and single-threaded programs. This was collected about a month ago but I did not got time to send due to an unexpected trip.
cmpxchg gives the worst performance due to the memory barriers it incurs. I'll send a patch that supports conditional_1 and unconditional_1. --------------------------------- result ----------------------------- base: original_coverage (1): using_conditional_1 -- using branch (my original implementation) (2): using_unconfitional_1 -- write straight 1 (3): using_cmpxchg -- using compxchg write 1 Values are performance ratios where 100.0 equals the performance of O2. Larger numbers are faster. "--" means the test failed due to running too slowly. arch : westmere Benchmark Base (1) (2) (3) --------------------------------------------------------- benchmark_1 26.4 +176.62% +17.20% -- benchmark_2 -- [78.4] [12.3] -- benchmark_3 86.3 +6.15% +10.52% -61.28% benchmark_4 88.4 +6.59% +14.26% -68.76% benchmark_5 89.6 +6.26% +13.00% -68.74% benchmark_6 76.7 +22.28% +29.15% -75.31% benchmark_7 89.0 -0.62% +3.36% -71.37% benchmark_8 84.5 -1.45% +5.27% -74.04% benchmark_9 81.3 +10.64% +13.32% -72.82% benchmark_10 59.1 +44.71% +14.77% -73.24% benchmark_11 90.3 -1.74% +4.22% -61.95% benchmark_12 98.9 +0.07% +0.48% -6.37% benchmark_13 74.0 -4.69% +4.35% -77.02% benchmark_14 21.4 +309.92% +63.41% -35.82% benchmark_15 21.4 +282.33% +58.15% -57.98% benchmark_16 85.1 -7.71% +1.65% -60.72% benchmark_17 81.7 +2.47% +8.20% -72.08% benchmark_18 83.7 +1.59% +3.83% -69.33% geometric mean +30.30% +14.41% -65.66% (incomplete) arch : sandybridge Benchmark Base (1) (2) (3) --------------------------------------------------------- benchmark_1 -- [70.1] [26.1] -- benchmark_2 -- [79.1] -- -- benchmark_3 84.3 +10.82% +15.84% -68.98% benchmark_4 88.5 +10.28% +11.35% -75.10% benchmark_5 89.4 +10.46% +11.40% -74.41% benchmark_6 65.5 +38.52% +44.46% -77.97% benchmark_7 87.7 -0.16% +1.74% -76.19% benchmark_8 89.6 -4.52% +6.29% -78.10% benchmark_9 79.9 +13.43% +19.44% -75.99% benchmark_10 52.6 +61.53% +8.23% -78.41% benchmark_11 89.9 -1.40% +3.37% -68.16% benchmark_12 99.0 +1.51% +0.63% -10.37% benchmark_13 74.3 -6.75% +3.89% -81.84% benchmark_14 21.8 +295.76% +19.48% -51.58% benchmark_15 23.5 +257.20% +29.33% -83.53% benchmark_16 84.4 -10.04% +2.39% -68.25% benchmark_17 81.6 +0.60% +8.82% -78.02% benchmark_18 87.4 -1.14% +9.69% -75.88% geometric mean +25.64% +11.76% -72.96% (incomplete) arch : clovertown Benchmark Base (1) (2) (3) -------------------------------------------------------------- benchmark_1 -- [83.4] -- -- benchmark_2 -- [82.3] -- -- benchmark_3 86.2 +7.58% +13.10% -81.74% benchmark_4 89.4 +5.69% +11.70% -82.97% benchmark_5 92.8 +4.67% +7.48% -80.02% benchmark_6 78.1 +13.28% +22.21% -86.92% benchmark_7 96.8 +0.25% +5.44% -84.94% benchmark_8 89.1 +0.66% +3.60% -85.89% benchmark_9 86.4 +8.42% +9.95% -82.30% benchmark_10 59.7 +44.95% +21.79% -- benchmark_11 91.2 -0.29% +4.35% -76.05% benchmark_12 99.0 +0.31% -0.05% -25.19% benchmark_14 8.2 +1011.27% +104.15% +5.56% benchmark_15 11.7 +669.25% +108.54% -29.83% benchmark_16 85.7 -7.51% +4.43% -- benchmark_17 87.7 +2.84% +7.45% -- benchmark_18 87.9 +1.59% +3.82% -81.11% geometric mean +37.89% +17.54% -74.47% (incomplete) arch : istanbul Benchmark Base (1) (2) (3) ---------------------------------------------------------- benchmark_1 -- [73.2] -- -- benchmark_2 -- [82.9] -- -- benchmark_3 86.1 +4.56% +11.68% -61.04% benchmark_4 92.0 +3.47% +4.63% -64.84% benchmark_5 91.9 +4.18% +4.90% -64.77% benchmark_6 73.6 +23.36% +27.13% -72.64% benchmark_7 93.6 -3.57% +4.76% -68.54% benchmark_8 88.9 -3.01% +2.87% -75.50% benchmark_9 81.6 +9.91% +14.10% -69.81% benchmark_10 69.8 +23.60% -1.53% -- benchmark_11 89.9 +0.34% +4.32% -59.62% benchmark_12 98.7 +0.96% +0.88% -10.52% benchmark_13 80.5 -8.31% -0.55% -75.90% benchmark_14 17.1 +429.47% -5.21% -4.87% benchmark_15 22.0 +295.70% -1.29% -46.36% benchmark_16 80.0 -6.31% +0.54% -62.61% benchmark_17 83.5 +4.84% +10.71% -70.48% benchmark_18 90.0 -1.27% +3.18% -72.30% geometric mean +24.51% +4.81% -62.44% (incomplete) benchmark_13 77.1 -3.34% +5.75% -- On Mon, Nov 25, 2013 at 11:19 AM, Rong Xu <x...@google.com> wrote: > On Mon, Nov 25, 2013 at 2:11 AM, Richard Biener > <richard.guent...@gmail.com> wrote: >> On Fri, Nov 22, 2013 at 10:49 PM, Rong Xu <x...@google.com> wrote: >>> On Fri, Nov 22, 2013 at 4:03 AM, Richard Biener >>> <richard.guent...@gmail.com> wrote: >>>> On Fri, Nov 22, 2013 at 4:51 AM, Rong Xu <x...@google.com> wrote: >>>>> Hi, >>>>> >>>>> This patch injects a condition into the instrumented code for edge >>>>> counter update. The counter value will not be updated after reaching >>>>> value 1. >>>>> >>>>> The feature is under a new parameter --param=coverage-exec_once. >>>>> Default is disabled and setting to 1 to enable. >>>>> >>>>> This extra check usually slows the program down. For SPEC 2006 >>>>> benchmarks (all single thread programs), we usually see around 20%-35% >>>>> slow down in -O2 coverage build. This feature, however, is expected to >>>>> improve the coverage run speed for multi-threaded programs, because >>>>> there virtually no data race and false sharing in updating counters. >>>>> The improvement can be significant for highly threaded programs -- we >>>>> are seeing 7x speedup in coverage test run for some non-trivial google >>>>> applications. >>>>> >>>>> Tested with bootstrap. >>>> >>>> Err - why not simply emit >>>> >>>> counter = 1 >>>> >>>> for the counter update itself with that --param (I don't like a --param >>>> for this either). >>>> >>>> I assume that CPUs can avoid data-races and false sharing for >>>> non-changing accesses? >>>> >>> >>> I'm not aware of any CPU having this feature. I think a write to the >>> shared cache line to invalidate all the shared copies. I cannot find >>> any reference on checking the value of the write. Do you have any >>> pointer to the feature? >> >> I don't have any pointer - but I remember seeing this in the context >> of atomics thus it may be only in the context of using a xchg >> or cmpxchg instruction. Which would make it non-portable to >> some extent (if you don't want to use atomic builtins here). >> > > cmpxchg should work here -- it's a conditional write so the data race > /false sharing can be avoided. > I'm comparing the performance b/w explicit branch vs cmpxchg and will > report back. > > -Rong > > >> Richard. >> >>> I just tested this implementation vs. simply setting to 1, using >>> google search as the benchmark. >>> This one is 4.5x faster. The test was done on Intel Westmere systems. >>> >>> I can change the parameter to an option. >>> >>> -Rong >>> >>>> Richard. >>>> >>>>> Thanks, >>>>> >>>>> -Rong