https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832
--- Comment #17 from Richard Biener <rguenth at gcc dot gnu.org> --- (In reply to Michael_S from comment #16) > On unrelated note, why loop overhead uses so many instructions? > Assuming that I am as misguided as gcc about load-op combining, I would > write it as: > sub %rax, %rdx > .L3: > vmovupd (%rdx,%rax), %ymm1 > vmovupd 32(%rdx,%rax), %ymm0 > vfmadd213pd 32(%rax), %ymm3, %ymm1 > vfnmadd213pd (%rax), %ymm2, %ymm0 > vfnmadd231pd 32(%rdx,%rax), %ymm3, %ymm0 > vfnmadd231pd (%rdx,%rax), %ymm2, %ymm1 > vmovupd %ymm0, (%rax) > vmovupd %ymm1, 32(%rax) > addq $64, %rax > decl %esi > jb .L3 > > The loop overhead in my variant is 3 x86 instructions==2 macro-ops, > vs 5 x86 instructions==4 macro-ops in gcc variant. > Also, in gcc variant all memory accesses have displacement that makes them > 1 byte longer. In my variant only half of accesses have displacement. > > I think, in the past I had seen cases where gcc generates optimal or > near-optimal > code sequences for loop overhead. I wonder why it can not do it here. I don't think we currently consider IVs based on the difference of two addresses. The cost benefit of no displacement is only size, otherwise I have no idea why we have biased the %rax accesses by -32. Why we fail to consider decrement-to-zero for the counter IV is probably because IVCANON would add such IV but the vectorizer replaces that and IVOPTs doesn't consider re-adding that.