https://gcc.gnu.org/bugzilla/show_bug.cgi?id=18438
Maxim Kuvyrkov <mkuvyrkov at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |mkuvyrkov at gcc dot gnu.org --- Comment #9 from Maxim Kuvyrkov <mkuvyrkov at gcc dot gnu.org> --- I've looked into another case where inability to handle stores with gaps generates sub-optimal code. I'm interested in spending some time on fixing this, provided some guidance in the vectorizer. Is it substantially more difficult to handle stores with gaps compared to loads with gaps? The following is [minimally] reduced from 462.libquantum:quantum_sigma_x(), which is #2 function in 462.libquantum profile. This cycle accounts for about 25% of total 462.libquantum time. ===struct node_struct { float _Complex gap; unsigned long long state; }; struct reg_struct { int size; struct node_struct *node; }; void func(int target, struct reg_struct *reg) { int i; for(i=0; i<reg->size; i++) reg->node[i].state ^= ((unsigned long long) 1 << target); } === This loop vectorizes into <bb 5>: # vectp.8_39 = PHI <vectp.8_40(6), vectp.9_38(4)> vect_array.10 = LOAD_LANES (MEM[(long long unsigned int *)vectp.8_39]); vect__5.11_41 = vect_array.10[0]; vect__5.12_42 = vect_array.10[1]; vect__7.13_44 = vect__5.11_41 ^ vect_cst__43; _48 = BIT_FIELD_REF <vect__7.13_44, 64, 0>; MEM[(long long unsigned int *)ivtmp_45] = _48; ivtmp_50 = ivtmp_45 + 16; _51 = BIT_FIELD_REF <vect__7.13_44, 64, 64>; MEM[(long long unsigned int *)ivtmp_50] = _51; which then becomes for aarch64: .L4: ld2 {v0.2d - v1.2d}, [x1] add w2, w2, 1 cmp w2, w7 eor v0.16b, v2.16b, v0.16b umov x4, v0.d[1] st1 {v0.d}[0], [x1] add x1, x1, 32 str x4, [x1, -16] bcc .L4