https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114151
Bug ID: 114151 Summary: [14 Regression] weird and inefficient codegen and addressing modes since g:a0b1798042d033fd2cc2c806afbb77875dd2909b Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: tnfchris at gcc dot gnu.org CC: rguenth at gcc dot gnu.org Target Milestone: --- Target: aarch64* Created attachment 57559 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57559&action=edit testcase The attached C++ testcase compiled with: -O3 -mcpu=neoverse-n2 used to compile a nice and simple loop. But after g:a0b1798042d033fd2cc2c806afbb77875dd2909b The codegen is weird and it uses horrible addressing modes. The first odd part is that it's decided to split the loop, the "main" loop has a guard after it to branch to the exit is the iteration count is 1. If not instead of just loop again it falls through the a copy of the main loop, but has destroyed addressing modes. The copy of the loop seems to have unshared the address calculations. Before we had: _128 = (void *) ivtmp.11_20; _54 = MEM <__SVFloat16_t> [(__fp16 *)_128]; _10 = MEM <__SVFloat16_t> [(__fp16 *)_128 + POLY_INT_CST [16B, 16B]]; _75 = MEM <__SVFloat16_t> [(__fp16 *)_128 + POLY_INT_CST [32B, 32B]]; etc, so all as an offset from _128. Now we have: col_i_61 = (int) ivtmp.11_100; _60 = (long unsigned int) col_i_61; _59 = _60 * 2; _58 = a_j_69 + _59; _54 = MEM <__SVFloat16_t> [(__fp16 *)_58]; _53 = _59 + POLY_INT_CST [16, 16]; _13 = a_j_69 + _53; _10 = MEM <__SVFloat16_t> [(__fp16 *)_13]; _74 = _59 + POLY_INT_CST [32, 32]; _19 = a_j_69 + _74; _75 = MEM <__SVFloat16_t> [(__fp16 *)_19]; and similarly for the stores as well. it also weirdly creates some very complicated addressing computations. Before we had: _144 = p_mat_16(D) + 6; _64 = MEM <__SVFloat16_t> [(__fp16 *)_144 + ivtmp.10_100 * 2]; _143 = p_mat_16(D) + 4; _84 = MEM <__SVFloat16_t> [(__fp16 *)_143 + ivtmp.10_100 * 2]; and after: ivtmp.23_130 = (unsigned long) p_mat_16(D); _123 = 2 - ivtmp.23_130; _124 = &MEM <__SVFloat16_t> [(__fp16 *)0B + _123 + ivtmp.12_109 * 2]; _64 = MEM <__SVFloat16_t> [(__fp16 *)_124]; _122 = -ivtmp.23_130; _120 = &MEM <__SVFloat16_t> [(__fp16 *)0B + _122 + ivtmp.12_109 * 2]; _84 = MEM <__SVFloat16_t> [(__fp16 *)_120]; This results in quite the codesize increase, and a 7-10% performance loss.