[PATCH] Fix an AARCH64/ARM performance regression

Bernd Edlinger Tue, 27 May 2014 01:11:32 -0700

Hi,

I have been informed, that the following check-in may cause a slight performance
regression on aarch64 and arm on trunk and gcc-4_9-branch:
See https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=205899


This can be seen with the following test example:

test.c:
typedef struct
{
  int *x;
  int *y;
  int *z;
  int *m;
  int *n;
  int *o;
}A;

typedef struct
{
  int x;
  int y;
  int z;
  int m;
  int n;
  int o;
}B;

A a[6];
B *b;

void test(int i)
{
  A *a1 = &a[i];
  B *b1 = &b[i];
  a1->x = &b1->x;
  a1->y = &b1->y;
  a1->z = &b1->z;
   a1->m = &b1->m;
  a1->n = &b1->n;
   a1->o = &b1->o;
}

when compiled with aarch64-elf-gcc -O3 -S we get the following
assembler code:

test:
        adrp         x1, b
        sxtw         x3, w0
        mov           w5, 24
        adrp         x2, a
        add            x2, x2, :lo12:a
        ldr          x4, [x1, #:lo12:b]
        lsl               x1, x3, 2
        sub            x1, x1, x3
        lsl               x1, x1, 4
        smaddl  x0, w0, w5, x4
        add            x3, x2, x1
        add            x7, x0, 8
        add            x6, x0, 16
        add            x8, x0, 4
               add            x5, x0, 12
        add            x4, x0, 20
        str              x0, [x2, x1]
        mov           x0, x3
        mov           x1, x3
               str              x8, [x3, 8]
               str              x7, [x0, 16]!
               str              x6, [x1, 32]!
               str              x5, [x0, 8]
        str              x4, [x1, 8]
        ret

Note the two mov instructions, and that two extra registers are used
to store the values.  The mov instructions have not been there before the patch.

After some investigation I found out how this happened:

The difference is first visible with -fdump-rtl-expand-slim:
    1: NOTE_INSN_DELETED
    4: NOTE_INSN_BASIC_BLOCK 2
    2: r83:SI=x0:SI
    3: NOTE_INSN_FUNCTION_BEG
    6: r74:DI=sign_extend(r83:SI)
    7: r85:DI=high(`b')
    8: r84:DI=r85:DI+low(`b')
      REG_EQUAL `b'
    9: r87:SI=0x18
   10: r86:DI=sign_extend(r83:SI)*sign_extend(r87:SI)
   11: r88:DI=[r84:DI]
   12: r76:DI=r88:DI+r86:DI
   13: r90:DI=high(`a')
   14: r89:DI=r90:DI+low(`a')
      REG_EQUAL `a'
   15: r91:DI=sign_extend(r83:SI)
   16: r92:DI=r91:DI
   17: r93:DI=r92:DI<<0x2
   18: r94:DI=r93:DI-r91:DI
      REG_EQUAL r91:DI*0x3
   19: r95:DI=r94:DI<<0x4
   20: r94:DI=r95:DI
      REG_EQUAL r91:DI*0x30
   21: r96:DI=r89:DI+r94:DI
   22: [r96:DI]=r76:DI
   23: r98:DI=high(`a')
   24: r97:DI=r98:DI+low(`a')
      REG_EQUAL `a'
   25: r99:DI=sign_extend(r83:SI)
   26: r100:DI=r99:DI
   27: r101:DI=r100:DI<<0x2
   28: r102:DI=r101:DI-r99:DI
      REG_EQUAL r99:DI*0x3
   29: r103:DI=r102:DI<<0x4
   30: r102:DI=r103:DI
      REG_EQUAL r99:DI*0x30
   31: r104:DI=r97:DI+r102:DI
   32: r105:DI=r76:DI+0x4
   33: [r104:DI+0x8]=r105:DI
   34: r107:DI=high(`a')
   35: r106:DI=r107:DI+low(`a')
      REG_EQUAL `a'
   36: r108:DI=sign_extend(r83:SI)
   37: r109:DI=r108:DI
   38: r110:DI=r109:DI<<0x2
   39: r111:DI=r110:DI-r108:DI
      REG_EQUAL r108:DI*0x3
   40: r112:DI=r111:DI<<0x4
   41: r111:DI=r112:DI
      REG_EQUAL r108:DI*0x30
   42: r113:DI=r106:DI+r111:DI
   43: r114:DI=r113:DI+0x10
   44: r115:DI=r76:DI+0x8
   45: [r114:DI]=r115:DI
   46: r117:DI=high(`a')
   47: r116:DI=r117:DI+low(`a')
      REG_EQUAL `a'
   48: r118:DI=sign_extend(r83:SI)
   49: r119:DI=r118:DI
   50: r120:DI=r119:DI<<0x2
   51: r121:DI=r120:DI-r118:DI
      REG_EQUAL r118:DI*0x3
   52: r122:DI=r121:DI<<0x4
   53: r121:DI=r122:DI
      REG_EQUAL r118:DI*0x30
   54: r123:DI=r116:DI+r121:DI
   55: r124:DI=r123:DI+0x10
   56: r125:DI=r76:DI+0xc
   57: [r124:DI+0x8]=r125:DI
   58: r127:DI=high(`a')
   59: r126:DI=r127:DI+low(`a')
      REG_EQUAL `a'
   60: r128:DI=sign_extend(r83:SI)
   61: r129:DI=r128:DI
   62: r130:DI=r129:DI<<0x2
   63: r131:DI=r130:DI-r128:DI
      REG_EQUAL r128:DI*0x3
   64: r132:DI=r131:DI<<0x4
   65: r131:DI=r132:DI
      REG_EQUAL r128:DI*0x30
   66: r133:DI=r126:DI+r131:DI
   67: r134:DI=r133:DI+0x20
   68: r135:DI=r76:DI+0x10
   69: [r134:DI]=r135:DI
   70: r137:DI=high(`a')
   71: r136:DI=r137:DI+low(`a')
      REG_EQUAL `a'
   72: r138:DI=sign_extend(r83:SI)
   73: r139:DI=r138:DI
   74: r140:DI=r139:DI<<0x2
   75: r141:DI=r140:DI-r138:DI
      REG_EQUAL r138:DI*0x3
   76: r142:DI=r141:DI<<0x4
   77: r141:DI=r142:DI
      REG_EQUAL r138:DI*0x30
   78: r143:DI=r136:DI+r141:DI
   79: r144:DI=r143:DI+0x20
   80: r145:DI=r76:DI+0x14
   81: [r144:DI+0x8]=r145:DI

Note the offset +0x8 on the instructions 33, 57, 81 but not on 22, 45, 69.
This artefact was not there before the patch.

Well, this causes little ripples in the later rtl-passes.
There is one pass that does a pretty good job at compensating
this effect: forward-propagate.   In simple cases the resulting code
does not look any different, because fwprop folds all offsets together,
and the result looks as before.   However in this example fwprop could
not remove some "dead" statements, and that made the auto-inc-dec pass
insert the mov statements.

So, what caused this?

You probably remember this statement in expr.c, which we did not fully
understand, but we wanted not to remove it because if could have subltle
effects on the code quality:

          /* The check for a constant address in TO_RTX not having VOIDmode
             is probably no longer necessary.  */
          if (MEM_P (to_rtx)
                 && GET_MODE (to_rtx) == BLKmode
    && GET_MODE (XEXP (to_rtx, 0)) != VOIDmode
    && bitsize> 0
    && (bitpos % bitsize) == 0
                 && (bitsize % GET_MODE_ALIGNMENT (mode1)) == 0
    && MEM_ALIGN (to_rtx) == GET_MODE_ALIGNMENT (mode1))
            {
   to_rtx = adjust_address (to_rtx, mode1, bitpos / BITS_PER_UNIT);
   bitregion_start = 0;
                if (bitregion_end>= (unsigned HOST_WIDE_INT) bitpos)
     bitregion_end -= bitpos;
                bitpos = 0;
            }

Now I see, my patch had an influence on the trigger condition for this 
statement:

 if (MEM_P (to_rtx))
         {
-         if (volatilep && flag_strict_volatile_bitfields> 0)
+         if (mode1 != VOIDmode)
                 to_rtx = adjust_address (to_rtx, mode1, 0);
   else if (GET_MODE (to_rtx) == VOIDmode)


As it turns out, this made GET_MODE (to_rtx) be unequal to BLKmode
in most cases.

In this example, get_inner_reference returns offset != NULL and
bitpos is either 0 or 64, for every second structure member.

When the adjust_address here is executed, this bitpos is folded into
the to_rtx, and the generated rtl is much more uniform.

So the condition for the if-statement has to be re-written, to
munge the bitpos into the ro_rtx, together with the offset, if
store_field can be expected to emit a single move instruction:
that means primarily the alignment is OK.
It should not depend on volatileness and a larger alignment
should also be no reason to bail out here.

Sorry for this lengthy explanation!

So I'd like to re-write the condition here, to something reasonable.
See the attached patch:

With this patch applied the resulting code is much better again:

test:
        adrp    x1, b
        sxtw    x2, w0
        adrp    x3, a
        mov     w5, 24
        add     x3, x3, :lo12:a
        ldr     x4, [x1, #:lo12:b]
        lsl     x1, x2, 2
        sub     x1, x1, x2
        lsl     x1, x1, 4
        add     x2, x3, x1
        smaddl  x0, w0, w5, x4
        str     x0, [x3, x1]
        add     x8, x0, 4
        add     x7, x0, 8
        add     x6, x0, 12
        add     x5, x0, 16
        add     x4, x0, 20
        str     x8, [x2, 8]
        str     x7, [x2, 16]
        str     x6, [x2, 24]
        str     x5, [x2, 32]
        str     x4, [x2, 40]
        ret


Boot-Strapped and regression tested on X86_64 and arm-linux-gnueabihf.
Ok for trunk?


PS: I wonder if this patch should also go into the 4.9 branch, after a while of 
course.
What do you think?


Thanks
Bernd.

patch-expr.diff
Description: Binary data

2014-05-27  Bernd Edlinger  <bernd.edlin...@hotmail.de>

        * expr.c (expand_assignment): Fold the bitpos in the to_rtx if
        sufficiently aligned and an offset is used at the same time.
        (expand_expr_real_1): Likewise.

[PATCH] Fix an AARCH64/ARM performance regression

Reply via email to