Xionghu, thanks for working on fixing the exchange regression.

luoxhu via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> On 2020/8/13 20:52, Jan Hubicka wrote:
>>> Since there are no other callers outside of these specialized nodes, the
>>> guessed profile count should be same equal?  Perf tool shows that even
>>> each specialized node is called only once, none of them take same time for
>>> each call:
>>>
>>>    40.65%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
>>> __brute_force_MOD_digits_2.constprop.4
>>>    16.31%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
>>> __brute_force_MOD_digits_2.constprop.3
>>>    10.91%  exchange2_gcc.o  libgfortran.so.5.0.0     [.] 
>>> _gfortran_mminloc0_4_i4
>>>     5.41%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
>>> __brute_force_MOD_digits_2.constprop.6
>>>     4.68%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
>>> __logic_MOD_new_solver
>>>     3.76%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
>>> __brute_force_MOD_digits_2.constprop.5
>>>     1.07%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
>>> __brute_force_MOD_digits_2.constprop.7
>>>     0.84%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
>>> __brute_force_MOD_brute.constprop.0
>>>     0.47%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
>>> __brute_force_MOD_digits_2.constprop.2
>>>     0.24%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
>>> __brute_force_MOD_digits_2.constprop.1
>>>     0.24%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
>>> __brute_force_MOD_covered.constprop.0
>>>     0.11%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
>>> __brute_force_MOD_reflected.constprop.0
>>>     0.00%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
>>> __brute_force_MOD_brute.constprop.1
>>>
>>>
>>> digits_2.constprop.4 & digits_2.constprop.3 takes most of the execution 
>>> time,
>>> So profile count and frequency seem not very helpful for this case?
>> Yep, you can not really determine the time spent on each of recursion
>> levels from the recursive edge probability since you can not assume it
>> to be the same on each level of recursion (here we know it is 0 at level
>> 10 and it is slowly dropping as the recursion increases because there
>> are fewer posiblities to fill up the partly filled sudoku:).
>> Especially you can not assume it after the ipa-cp did the work to figure
>> out that there is recursion depth counter that affect the function.
>> 
>> Thinking of the ipa-cp profile updating changes, I did not consider safe
>> the approach of adding extra weight to the recursive call. The problem
>> is that the recursive call frequency estimate, when comming from static
>> profile stimation, is likely completely off, just like static profile
>> estimator can not be used to predict realistically the number of
>> iterations of loops.  However even with profile feedback this is hard to
>> interpret.
>> 
>> I was wondering if we can not simply detect this scenario and avoid
>> using the recursive edge probability in the profile updating.
>> We could simply scale by the number of copies.
>> This means that if we produce 10 clones, we could just set each clone to
>> 1/10th of the frequency.  This implies that the hottest spot will not be
>> underestimated expontentially much as can happen now, but just 10 times
>> at worst, wich is probably still OK. We play similar games in loop
>> optimizers: static profile estimators expect every loop to trip around 3
>> times so unroller/vectorizer/etc would make no sense on them.
>> 
>> By scaling down according to number of copies we will keep the
>> frequencies of calls to function called from clones "sane".  We will
>> still run into problems when we inline one clone to another (sine its
>> proflie will get scaled by the recursive edge frequency), but perhaps
>> that can be special cases in profiler or make ipa-cp to adjust the
>> recursive edges to get frequency close to 1 as a hack.
>> 
>> This is not pretty, but at least it looks safer to me than the original
>> profile updating patch adding extra weight to the frequency.
>
>
> Really appreciate for your detailed explanation.  BTW, My previous patch
> for PGO build on exchange2 takes this similar method by setting each cloned
> node to 1/10th of the frequency several month agao :)
>
> https://gcc.gnu.org/pipermail/gcc-patches/2020-June/546926.html

Does it seem likely that we'll reach a resolution on this soon?
I take the point that the patch that introduced the exchange regression
[https://gcc.gnu.org/pipermail/gcc-patches/2020-August/551757.html]
was just uncovering a latent issue, but still, this is a large regression
in an important benchmark to be carrying around.  For those of us doing
regular benchmark CI, the longer the performance trough gets, the harder
it is to spot other unrelated regressions in the “properly optimised” code.

So if we don't have a specific plan for fixing the regression soon,
I think we should consider reverting the patch until we have something
that avoids the exchange regression, even though the patch wasn't
technically wrong.

Thanks,
Richard

Reply via email to