https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115340
Bug ID: 115340 Summary: Loop/SLP vectorization possible inefficiency Product: gcc Version: 15.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: rdapp at gcc dot gnu.org CC: rguenth at gcc dot gnu.org Target Milestone: --- Target: riscv Given the following loop which is heavily adjusted from x264 satd's last loop to showcase a particular problem. I'm not sure it's still representative of x264 but well... I used riscv as target but avx512 and aarch64 look similar. typedef unsigned int uint32_t; void foo (uint32_t **restrict tmp, uint32_t *restrict out) { uint32_t sum = 0; for( int i = 0; i < 4; i++ ) { out[i + 0] = tmp[0][i] + 1; out[i + 4] = tmp[1][i] + 1; out[i + 8] = tmp[2][i] + 1; out[i + 12] = tmp[3][i] + 1; } } we loop vectorize it as 4x unrolled vect__5.6_70 = MEM <vector(4) unsigned int> [(uint32_t *)vectp.4_72]; vect__7.7_68 = vect__5.6_70 + { 1, 1, 1, 1 }; MEM <vector(4) unsigned int> [(uint32_t *)vectp_out.8_67] = vect__7.7_68; ... On riscv: vsetivli zero,4,e32,mf2,ta,ma vle32.v v1,0(a4) vadd.vi v1,v1,1 vse32.v v1,0(a1) addi a4,a1,16 vle32.v v1,0(a2) vadd.vi v1,v1,1 vse32.v v1,0(a4) addi a4,a1,32 vle32.v v1,0(a3) vadd.vi v1,v1,1 vse32.v v1,0(a4) addi a1,a1,48 vle32.v v1,0(a5) vadd.vi v1,v1,1 vse32.v v1,0(a1) That's not ideal when a 64-byte vector can hold all values simultaneously (and permuting four 16-byte vectors into one is acceptable). Of course the exact sequence will depend on costs but on riscv I would expect code along those lines (similar to what clang produces): vsetivli zero, 4, e32, mf2, ta, ma vle32.v v8, (a2) vle32.v v9, (a4) vle32.v v10, (a0) vle32.v v11, (a3) vsetivli zero, 8, e32, mf2, ta, ma vslideup.vi v9, v10, 4 vslideup.vi v8, v11, 4 vsetivli zero, 16, e32, m1, ta, ma vslideup.vi v8, v9, 8 vadd.vi v8, v8, 1 vse32.v v8, (a1) 4 loads, 3 permutes, one add and one store. When disabling tree vectorization, SLP gets us halfway there but still loads each value separately: ... _17 = MEM[(uint32_tD.2756 *)_15 + 12B clique 1 base 0]; _24 = MEM[(uint32_tD.2756 *)_22 + 12B clique 1 base 0]; _69 = {_66, _81, _114, _5, _61, _86, _119, _10, _54, _93, _126, _17, _47, _100, _133, _24}; vect__64.5_68 = _69 + { 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 }; MEM <vector(16) unsigned intD.4> [(uint32_tD.2756 *)out_33(D) clique 1 base 1] = vect__64.5_68; I suppose that's the way it should work. I'd rather have the grouped loads be done with vector loads and permuted into place (rather than scalar loads and "permute" each one individually)? Now, with SLPification in progress I suppose that goal is a bit more realistic as before. Richi, is there something that can be done for now to get us closer? I'm currently looking into your rguenth/load-perm branch, is that a reasonable way to start?