Re: Asm volatile causing performance regressions on ARM

Georg-Johann Lay Fri, 28 Feb 2014 01:19:17 -0800

Am 02/27/2014 06:03 PM, schrieb Richard Sandiford:

Yury Gribov <y.gri...@samsung.com> writes:

Richard Biener wrote:

If this behavior is not intended, what would be the best way to fix
performance? I could teach GCC to not remove constant RTXs in
flush_hash_table() but this is probably very naive and won't cover some
corner-cases.


That could be a good starting point though.


Though with modifying "machine state" you can modify constants as well, no?


Valid point but this would mean relying on compiler to always load all
constants from memory (instead of, say, generating them via movhi/movlo)
for a piece of code which looks extremely unstable.


Right.  And constant rtx codes have mode-independent semantics.
(const_int 1) is always 1, whatever a volatile asm does.  Same for
const_double, symbol_ref, label_ref, etc.  If a constant load is implemented
using some mode-dependent operation then it would need to be represented
as something like an unspec instead.  But even then, the result would
usually be annotated with a REG_EQUAL note giving the value of the final
register result.  It should be perfectly OK to reuse that register after
a volatile asm if the value in the REG_EQUAL note is needed again.

What is the general attitude towards volatile asm? Are people interested
in making it more defined/performant or should we just leave this can of
worms as is? I can try to improve generated code but my patches will be
doomed if there is no consensus on what volatile asm actually means...


I think part of the problem is that some parts of GCC (like the one you
noted) are far more conservative than others.  E.g. take:

   void foo (int x, int *y)
   {
     y[0] = x + 1;
     asm volatile ("# asm");
     y[1] = x + 1;
   }

The extra-paranoid check you pointed out means that we assume that
x + 1 is no longer available after the asm for rtx-level CSE, but take
the opposite view for tree-level CSE, which happily optimises away the
second +.

Some places were (maybe still are) worried that volatile asms could
clobber any register they like.  But the register allocator assumes that
registers are preserved across volatile asms unless explicitly clobbered.
And AFAIK it always has.  So in the above example we get:

         addl    $1, %edi
         movl    %edi, (%rsi)
#APP
# 4 "/tmp/foo.c" 1
         # asm
# 0 "" 2
#NO_APP
         movl    %edi, 4(%rsi)
         ret

with %edi being live across the asm.

We do nothing this draconian for a normal function call, which could
easily use a volatile asm internally.  IMO anything that isn't flushed
for a call shouldn't be flushed for a volatile asm either.


Disagree here.

Just take the example I gave in my other main, worked out a bit here:

extern double __attribute__((const)) costly_func (double);
extern void func (void);

extern double X;

void code1_func (double x)
{
     func ();
     X = costly_func (x);
}

void code2_func (double x)
{
     double temp.1 = costly_func(x);
     func ();
     X = temp.1;
}

void code1_asm (double x)
{
     asm volatile ("disable interrupts":::"memory");
     // some atomic operations not involving x or X or costly_func
     asm volatile ("re-enable interrupts":::"memory");

     X = costly_func (x);
}

void code2_asm (double x)
{
     asm volatile ("disable interrupts":::"memory");
     double temp.1 = costly_func(x);
     // some atomic operations not involving x or X or costly_func
     asm volatile ("re-enable interrupts":::"memory");

     X = temp.1;
}

costly_func can be moved across func as usual, i.e. the transformation fromcode1_func to code2_func is in all right.

But this is no more the case if we move costly_func across an asm barrierinstead of across an ordinary function.

Now observe the impact on interrupt respond times. This is a side effect notcovered by the standards' notion of "side effects". But it is part of "anyweird side effect" than asm barrier can have.

WHereas code1_asm has no big impact in interrupt respond times (provided thecode in comments is small), code2_asm is disastrous w.r.t. interrupt respond times.

Notice that in code1, func might contain such asm-pairs to implement atomicoperations, but moving costly_func across func does *not* affect the interruptrespond times in such a disastrous way.

Thus you must be *very* careful w.r.t. optimizing against asm volatile + memoryclobber. It's too easy to miss some side effects of *real* code.

And this is just one example that proves that moving const across asm is notsane. Setting floating point rounding mode might be an other, but are surelymany other cases you don't imagine and I don't imagine. Thus always keep inmind that GCC is supposed to be a compiler that shall generate code for *real*machines with real *weird* hardware of requirements.

Optimizing code to scrap and pointing to some GCC internal reasoning or somestandard's wording does not help with real code.


Johann


One of the big grey areas is what should happen for floating-point ops
that depend on the current rounding mode.  That isn't really modelled
properly yet though.  Again, it affects calls as well as volatile asms.

Thanks,
Richard

Re: Asm volatile causing performance regressions on ARM

Reply via email to