Hi, I have been informed, that the following check-in may cause a slight performance regression on aarch64 and arm on trunk and gcc-4_9-branch: See https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=205899
This can be seen with the following test example: test.c: typedef struct { int *x; int *y; int *z; int *m; int *n; int *o; }A; typedef struct { int x; int y; int z; int m; int n; int o; }B; A a[6]; B *b; void test(int i) { A *a1 = &a[i]; B *b1 = &b[i]; a1->x = &b1->x; a1->y = &b1->y; a1->z = &b1->z; a1->m = &b1->m; a1->n = &b1->n; a1->o = &b1->o; } when compiled with aarch64-elf-gcc -O3 -S we get the following assembler code: test: adrp x1, b sxtw x3, w0 mov w5, 24 adrp x2, a add x2, x2, :lo12:a ldr x4, [x1, #:lo12:b] lsl x1, x3, 2 sub x1, x1, x3 lsl x1, x1, 4 smaddl x0, w0, w5, x4 add x3, x2, x1 add x7, x0, 8 add x6, x0, 16 add x8, x0, 4 add x5, x0, 12 add x4, x0, 20 str x0, [x2, x1] mov x0, x3 mov x1, x3 str x8, [x3, 8] str x7, [x0, 16]! str x6, [x1, 32]! str x5, [x0, 8] str x4, [x1, 8] ret Note the two mov instructions, and that two extra registers are used to store the values. The mov instructions have not been there before the patch. After some investigation I found out how this happened: The difference is first visible with -fdump-rtl-expand-slim: 1: NOTE_INSN_DELETED 4: NOTE_INSN_BASIC_BLOCK 2 2: r83:SI=x0:SI 3: NOTE_INSN_FUNCTION_BEG 6: r74:DI=sign_extend(r83:SI) 7: r85:DI=high(`b') 8: r84:DI=r85:DI+low(`b') REG_EQUAL `b' 9: r87:SI=0x18 10: r86:DI=sign_extend(r83:SI)*sign_extend(r87:SI) 11: r88:DI=[r84:DI] 12: r76:DI=r88:DI+r86:DI 13: r90:DI=high(`a') 14: r89:DI=r90:DI+low(`a') REG_EQUAL `a' 15: r91:DI=sign_extend(r83:SI) 16: r92:DI=r91:DI 17: r93:DI=r92:DI<<0x2 18: r94:DI=r93:DI-r91:DI REG_EQUAL r91:DI*0x3 19: r95:DI=r94:DI<<0x4 20: r94:DI=r95:DI REG_EQUAL r91:DI*0x30 21: r96:DI=r89:DI+r94:DI 22: [r96:DI]=r76:DI 23: r98:DI=high(`a') 24: r97:DI=r98:DI+low(`a') REG_EQUAL `a' 25: r99:DI=sign_extend(r83:SI) 26: r100:DI=r99:DI 27: r101:DI=r100:DI<<0x2 28: r102:DI=r101:DI-r99:DI REG_EQUAL r99:DI*0x3 29: r103:DI=r102:DI<<0x4 30: r102:DI=r103:DI REG_EQUAL r99:DI*0x30 31: r104:DI=r97:DI+r102:DI 32: r105:DI=r76:DI+0x4 33: [r104:DI+0x8]=r105:DI 34: r107:DI=high(`a') 35: r106:DI=r107:DI+low(`a') REG_EQUAL `a' 36: r108:DI=sign_extend(r83:SI) 37: r109:DI=r108:DI 38: r110:DI=r109:DI<<0x2 39: r111:DI=r110:DI-r108:DI REG_EQUAL r108:DI*0x3 40: r112:DI=r111:DI<<0x4 41: r111:DI=r112:DI REG_EQUAL r108:DI*0x30 42: r113:DI=r106:DI+r111:DI 43: r114:DI=r113:DI+0x10 44: r115:DI=r76:DI+0x8 45: [r114:DI]=r115:DI 46: r117:DI=high(`a') 47: r116:DI=r117:DI+low(`a') REG_EQUAL `a' 48: r118:DI=sign_extend(r83:SI) 49: r119:DI=r118:DI 50: r120:DI=r119:DI<<0x2 51: r121:DI=r120:DI-r118:DI REG_EQUAL r118:DI*0x3 52: r122:DI=r121:DI<<0x4 53: r121:DI=r122:DI REG_EQUAL r118:DI*0x30 54: r123:DI=r116:DI+r121:DI 55: r124:DI=r123:DI+0x10 56: r125:DI=r76:DI+0xc 57: [r124:DI+0x8]=r125:DI 58: r127:DI=high(`a') 59: r126:DI=r127:DI+low(`a') REG_EQUAL `a' 60: r128:DI=sign_extend(r83:SI) 61: r129:DI=r128:DI 62: r130:DI=r129:DI<<0x2 63: r131:DI=r130:DI-r128:DI REG_EQUAL r128:DI*0x3 64: r132:DI=r131:DI<<0x4 65: r131:DI=r132:DI REG_EQUAL r128:DI*0x30 66: r133:DI=r126:DI+r131:DI 67: r134:DI=r133:DI+0x20 68: r135:DI=r76:DI+0x10 69: [r134:DI]=r135:DI 70: r137:DI=high(`a') 71: r136:DI=r137:DI+low(`a') REG_EQUAL `a' 72: r138:DI=sign_extend(r83:SI) 73: r139:DI=r138:DI 74: r140:DI=r139:DI<<0x2 75: r141:DI=r140:DI-r138:DI REG_EQUAL r138:DI*0x3 76: r142:DI=r141:DI<<0x4 77: r141:DI=r142:DI REG_EQUAL r138:DI*0x30 78: r143:DI=r136:DI+r141:DI 79: r144:DI=r143:DI+0x20 80: r145:DI=r76:DI+0x14 81: [r144:DI+0x8]=r145:DI Note the offset +0x8 on the instructions 33, 57, 81 but not on 22, 45, 69. This artefact was not there before the patch. Well, this causes little ripples in the later rtl-passes. There is one pass that does a pretty good job at compensating this effect: forward-propagate. In simple cases the resulting code does not look any different, because fwprop folds all offsets together, and the result looks as before. However in this example fwprop could not remove some "dead" statements, and that made the auto-inc-dec pass insert the mov statements. So, what caused this? You probably remember this statement in expr.c, which we did not fully understand, but we wanted not to remove it because if could have subltle effects on the code quality: /* The check for a constant address in TO_RTX not having VOIDmode is probably no longer necessary. */ if (MEM_P (to_rtx) && GET_MODE (to_rtx) == BLKmode && GET_MODE (XEXP (to_rtx, 0)) != VOIDmode && bitsize> 0 && (bitpos % bitsize) == 0 && (bitsize % GET_MODE_ALIGNMENT (mode1)) == 0 && MEM_ALIGN (to_rtx) == GET_MODE_ALIGNMENT (mode1)) { to_rtx = adjust_address (to_rtx, mode1, bitpos / BITS_PER_UNIT); bitregion_start = 0; if (bitregion_end>= (unsigned HOST_WIDE_INT) bitpos) bitregion_end -= bitpos; bitpos = 0; } Now I see, my patch had an influence on the trigger condition for this statement: if (MEM_P (to_rtx)) { - if (volatilep && flag_strict_volatile_bitfields> 0) + if (mode1 != VOIDmode) to_rtx = adjust_address (to_rtx, mode1, 0); else if (GET_MODE (to_rtx) == VOIDmode) As it turns out, this made GET_MODE (to_rtx) be unequal to BLKmode in most cases. In this example, get_inner_reference returns offset != NULL and bitpos is either 0 or 64, for every second structure member. When the adjust_address here is executed, this bitpos is folded into the to_rtx, and the generated rtl is much more uniform. So the condition for the if-statement has to be re-written, to munge the bitpos into the ro_rtx, together with the offset, if store_field can be expected to emit a single move instruction: that means primarily the alignment is OK. It should not depend on volatileness and a larger alignment should also be no reason to bail out here. Sorry for this lengthy explanation! So I'd like to re-write the condition here, to something reasonable. See the attached patch: With this patch applied the resulting code is much better again: test: adrp x1, b sxtw x2, w0 adrp x3, a mov w5, 24 add x3, x3, :lo12:a ldr x4, [x1, #:lo12:b] lsl x1, x2, 2 sub x1, x1, x2 lsl x1, x1, 4 add x2, x3, x1 smaddl x0, w0, w5, x4 str x0, [x3, x1] add x8, x0, 4 add x7, x0, 8 add x6, x0, 12 add x5, x0, 16 add x4, x0, 20 str x8, [x2, 8] str x7, [x2, 16] str x6, [x2, 24] str x5, [x2, 32] str x4, [x2, 40] ret Boot-Strapped and regression tested on X86_64 and arm-linux-gnueabihf. Ok for trunk? PS: I wonder if this patch should also go into the 4.9 branch, after a while of course. What do you think? Thanks Bernd.
patch-expr.diff
Description: Binary data
2014-05-27 Bernd Edlinger <bernd.edlin...@hotmail.de> * expr.c (expand_assignment): Fold the bitpos in the to_rtx if sufficiently aligned and an offset is used at the same time. (expand_expr_real_1): Likewise.