https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69653
Bug ID: 69653 Summary: More CSE opportunities in address expressions Product: gcc Version: 6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: amker at gcc dot gnu.org Target Milestone: --- Given below simple test: #define MAX (1024) void bar1 (int *, int *, int *, int *); void bar2 (int *, int *, int *, int *); int foo (unsigned int n) { unsigned int i; int a0[MAX], a1[MAX], a2[MAX], a3[MAX]; bar1 (a0, a1, a2, a3); for (i = 0; i < n; i++) { unsigned char j = (unsigned char) i; a0[j] = a0[j] + 1; a1[j + 1] = a1[j + 1] + 2; a2[j + 2] = a2[j + 2] + 1; a3[j + 3] = a3[j + 3] + 3; } bar2 (a0, a1, a2, a3); return 0; } Below code is generated on AArch64 with option "-O2 -mcpu=cortex-a57": foo: sub sp, sp, #16384 mov x1, 12352 stp x29, x30, [sp, -64]! add x29, sp, 0 stp x19, x20, [sp, 16] add x20, x29, x1 mov x1, 8256 stp x21, x22, [sp, 32] add x21, x29, x1 mov x1, 4160 add x22, x29, x1 add x19, x29, 64 str x23, [sp, 48] mov x3, x20 mov w23, w0 mov x2, x21 mov x1, x22 mov x0, x19 bl bar1 cbz w23, .L4 mov w4, 0 .p2align 2 .L5: and w0, w4, 255 add w4, w4, 1 add w3, w0, 1 add w2, w0, 2 add w1, w0, 3 sxtw x3, w3 cmp w23, w4 sxtw x0, w0 sxtw x2, w2 ldr w7, [x22, x3, lsl 2] sxtw x1, w1 ldr w8, [x19, x0, lsl 2] ldr w6, [x21, x2, lsl 2] add w7, w7, 2 ldr w5, [x20, x1, lsl 2] add w8, w8, 1 str w7, [x22, x3, lsl 2] add w6, w6, 1 str w8, [x19, x0, lsl 2] add w5, w5, 3 str w6, [x21, x2, lsl 2] str w5, [x20, x1, lsl 2] bne .L5 .L4: mov x3, x20 mov x2, x21 mov x1, x22 mov x0, x19 bl bar2 ldp x19, x20, [sp, 16] mov w0, 0 ldp x21, x22, [sp, 32] ldr x23, [sp, 48] ldp x29, x30, [sp], 64 add sp, sp, 16384 ret The loop is not optimal because CSE opportunities among address expressions are missed. All addr expressions in this case are in the form of "base + i << scale + const". The scaling part "i << scale" is common sub expression, while "base + const" part is (sfp related) loop invariant. In my understanding, tree passes like ivopt/slsr should be able to handle cse in addresses, and let RTL optimizer decide if the loop-invariant part can be embedded in addressing mode or be hoisted out of loop. In this case, the index variable isn't a SCEV, so ivopt can't help. In my understanding, SLSR are designed to do strength reduction in addr exprs, maybe it should be improved to deal with addresses with different bases. Also lowering all memory reference to MEM_REF/TARGE_MEM_REF before SLSR could make SLSR's job easier. After all, below pseudo code in loop is wanted: Loop-preheader: r0 = base0 + const0; r1 = base1 + const1; r2 = base2 + const2; r3 = base3 + const3; loop-body: x = i << scale; ldr [r0 + x] ldr [r1 + x] ldr [r2 + x] ldr [r3 + x] //... str [r0 + x] str [r1 + x] str [r2 + x] str [r3 + x]