Re: [PATCH] Conditional count update for fast coverage test in multi-threaded programs

Rong Xu Fri, 20 Dec 2013 14:45:23 -0800

Here are the results using our internal benchmarks which are a mixed a
multi-threaded and single-threaded programs.
This was collected about a month ago but I did not got time to send
due to an unexpected trip.


cmpxchg gives the worst performance due to the memory barriers it incurs.
I'll send a patch that supports conditional_1 and unconditional_1.

--------------------------------- result -----------------------------

base: original_coverage
(1): using_conditional_1 -- using branch (my original implementation)
(2): using_unconfitional_1 -- write straight 1
(3): using_cmpxchg -- using compxchg write 1

Values are performance ratios where 100.0 equals the performance of
O2. Larger numbers are faster.
"--" means the test failed due to running too slowly.

arch        : westmere
  Benchmark           Base      (1)    (2)      (3)
---------------------------------------------------------
benchmark_1            26.4  +176.62%  +17.20%        --
benchmark_2              --    [78.4]   [12.3]        --
benchmark_3            86.3    +6.15%  +10.52%   -61.28%
benchmark_4            88.4    +6.59%  +14.26%   -68.76%
benchmark_5            89.6    +6.26%  +13.00%   -68.74%
benchmark_6            76.7   +22.28%  +29.15%   -75.31%
benchmark_7            89.0    -0.62%   +3.36%   -71.37%
benchmark_8            84.5    -1.45%   +5.27%   -74.04%
benchmark_9            81.3   +10.64%  +13.32%   -72.82%
benchmark_10           59.1   +44.71%  +14.77%   -73.24%
benchmark_11           90.3    -1.74%   +4.22%   -61.95%
benchmark_12           98.9    +0.07%   +0.48%    -6.37%
benchmark_13           74.0    -4.69%   +4.35%   -77.02%
benchmark_14           21.4  +309.92%  +63.41%   -35.82%
benchmark_15           21.4  +282.33%  +58.15%   -57.98%
benchmark_16           85.1    -7.71%   +1.65%   -60.72%
benchmark_17           81.7    +2.47%   +8.20%   -72.08%
benchmark_18           83.7    +1.59%   +3.83%   -69.33%
geometric mean                 +30.30%  +14.41%  -65.66% (incomplete)

arch        : sandybridge
  Benchmark           Base    (1)       (2)      (3)
---------------------------------------------------------
benchmark_1             --    [70.1]   [26.1]       --
benchmark_2             --    [79.1]       --       --
benchmark_3           84.3   +10.82%  +15.84%  -68.98%
benchmark_4           88.5   +10.28%  +11.35%  -75.10%
benchmark_5           89.4   +10.46%  +11.40%  -74.41%
benchmark_6           65.5   +38.52%  +44.46%  -77.97%
benchmark_7           87.7    -0.16%   +1.74%  -76.19%
benchmark_8           89.6    -4.52%   +6.29%  -78.10%
benchmark_9           79.9   +13.43%  +19.44%  -75.99%
benchmark_10          52.6   +61.53%   +8.23%  -78.41%
benchmark_11          89.9    -1.40%   +3.37%  -68.16%
benchmark_12          99.0    +1.51%   +0.63%  -10.37%
benchmark_13          74.3    -6.75%   +3.89%  -81.84%
benchmark_14          21.8  +295.76%  +19.48%  -51.58%
benchmark_15          23.5  +257.20%  +29.33%  -83.53%
benchmark_16          84.4   -10.04%   +2.39%  -68.25%
benchmark_17          81.6    +0.60%   +8.82%  -78.02%
benchmark_18          87.4    -1.14%   +9.69%  -75.88%
geometric mean               +25.64%  +11.76%  -72.96% (incomplete)

arch        : clovertown
  Benchmark         Base       (1)       (2)        (3)
--------------------------------------------------------------
benchmark_1             --     [83.4]        --         --
benchmark_2             --     [82.3]        --         --
benchmark_3           86.2     +7.58%   +13.10%    -81.74%
benchmark_4           89.4     +5.69%   +11.70%    -82.97%
benchmark_5           92.8     +4.67%    +7.48%    -80.02%
benchmark_6           78.1    +13.28%   +22.21%    -86.92%
benchmark_7           96.8     +0.25%    +5.44%    -84.94%
benchmark_8           89.1     +0.66%    +3.60%    -85.89%
benchmark_9           86.4     +8.42%    +9.95%    -82.30%
benchmark_10          59.7    +44.95%   +21.79%         --
benchmark_11          91.2     -0.29%    +4.35%    -76.05%
benchmark_12          99.0     +0.31%    -0.05%    -25.19%
benchmark_14           8.2  +1011.27%  +104.15%     +5.56%
benchmark_15          11.7   +669.25%  +108.54%    -29.83%
benchmark_16          85.7     -7.51%    +4.43%         --
benchmark_17          87.7     +2.84%    +7.45%         --
benchmark_18          87.9     +1.59%    +3.82%    -81.11%
geometric mean                 +37.89%   +17.54%  -74.47% (incomplete)

arch        : istanbul

 Benchmark           Base    (1)       (2)       (3)
----------------------------------------------------------
benchmark_1             --    [73.2]       --         --
benchmark_2             --    [82.9]       --         --
benchmark_3           86.1    +4.56%  +11.68%    -61.04%
benchmark_4           92.0    +3.47%   +4.63%    -64.84%
benchmark_5           91.9    +4.18%   +4.90%    -64.77%
benchmark_6           73.6   +23.36%  +27.13%    -72.64%
benchmark_7           93.6    -3.57%   +4.76%    -68.54%
benchmark_8           88.9    -3.01%   +2.87%    -75.50%
benchmark_9           81.6    +9.91%  +14.10%    -69.81%
benchmark_10          69.8   +23.60%   -1.53%         --
benchmark_11          89.9    +0.34%   +4.32%    -59.62%
benchmark_12          98.7    +0.96%   +0.88%    -10.52%
benchmark_13          80.5    -8.31%   -0.55%    -75.90%
benchmark_14          17.1  +429.47%   -5.21%     -4.87%
benchmark_15          22.0  +295.70%   -1.29%    -46.36%
benchmark_16          80.0    -6.31%   +0.54%    -62.61%
benchmark_17          83.5    +4.84%  +10.71%    -70.48%
benchmark_18          90.0    -1.27%   +3.18%    -72.30%
geometric mean               +24.51%   +4.81%  -62.44% (incomplete)

benchmark_13          77.1     -3.34%    +5.75%         --

On Mon, Nov 25, 2013 at 11:19 AM, Rong Xu <x...@google.com> wrote:
> On Mon, Nov 25, 2013 at 2:11 AM, Richard Biener
> <richard.guent...@gmail.com> wrote:
>> On Fri, Nov 22, 2013 at 10:49 PM, Rong Xu <x...@google.com> wrote:
>>> On Fri, Nov 22, 2013 at 4:03 AM, Richard Biener
>>> <richard.guent...@gmail.com> wrote:
>>>> On Fri, Nov 22, 2013 at 4:51 AM, Rong Xu <x...@google.com> wrote:
>>>>> Hi,
>>>>>
>>>>> This patch injects a condition into the instrumented code for edge
>>>>> counter update. The counter value will not be updated after reaching
>>>>> value 1.
>>>>>
>>>>> The feature is under a new parameter --param=coverage-exec_once.
>>>>> Default is disabled and setting to 1 to enable.
>>>>>
>>>>> This extra check usually slows the program down. For SPEC 2006
>>>>> benchmarks (all single thread programs), we usually see around 20%-35%
>>>>> slow down in -O2 coverage build. This feature, however, is expected to
>>>>> improve the coverage run speed for multi-threaded programs, because
>>>>> there virtually no data race and false sharing in updating counters.
>>>>> The improvement can be significant for highly threaded programs -- we
>>>>> are seeing 7x speedup in coverage test run for some non-trivial google
>>>>> applications.
>>>>>
>>>>> Tested with bootstrap.
>>>>
>>>> Err - why not simply emit
>>>>
>>>>   counter = 1
>>>>
>>>> for the counter update itself with that --param (I don't like a --param
>>>> for this either).
>>>>
>>>> I assume that CPUs can avoid data-races and false sharing for
>>>> non-changing accesses?
>>>>
>>>
>>> I'm not aware of any CPU having this feature. I think a write to the
>>> shared cache line to invalidate all the shared copies. I cannot find
>>> any reference on checking the value of the write. Do you have any
>>> pointer to the feature?
>>
>> I don't have any pointer - but I remember seeing this in the context
>> of atomics thus it may be only in the context of using a xchg
>> or cmpxchg instruction.  Which would make it non-portable to
>> some extent (if you don't want to use atomic builtins here).
>>
>
> cmpxchg should work here -- it's a conditional write so the data race
> /false sharing can be avoided.
> I'm comparing the performance b/w explicit branch vs cmpxchg and will
> report back.
>
> -Rong
>
>
>> Richard.
>>
>>> I just tested this implementation vs. simply setting to 1, using
>>> google search as the benchmark.
>>> This one is 4.5x faster. The test was done on Intel Westmere systems.
>>>
>>> I can change the parameter to an option.
>>>
>>> -Rong
>>>
>>>> Richard.
>>>>
>>>>> Thanks,
>>>>>
>>>>> -Rong

Re: [PATCH] Conditional count update for fast coverage test in multi-threaded programs

Reply via email to