https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |hjl.tools at gmail dot com --- Comment #34 from Richard Biener <rguenth at gcc dot gnu.org> --- So with -Ofast and -mprefer-vector-width=256 I get <bb 2> [local count: 63136019]: vect__4.2_3 = MEM[(float *)a_11(D)]; vect__5.3_4 = RSQRT (vect__4.2_3); MEM[(float *)r_12(D)] = vect__5.3_4; vect__4.2_21 = MEM[(float *)a_11(D) + 32B]; vect__5.3_20 = RSQRT (vect__4.2_21); MEM[(float *)r_12(D) + 32B] = vect__5.3_20; while with -mprefer-vector-width=512 I need -mavx512er to trigger the expander, then I also get <bb 2> [local count: 63136020]: vect__4.2_21 = MEM[(float *)a_11(D)]; vect__5.3_20 = RSQRT (vect__4.2_21); MEM[(float *)r_12(D)] = vect__5.3_20; and in that case rsqrt: .LFB12: .cfi_startproc vrsqrt28ps (%rsi), %zmm0 vmovups %zmm0, (%rdi) vzeroupper ret (huh? isn't there a NR step missing?) for -mprefer-vector-width=256 I get (irrespective of -mavx512er): rsqrt: .LFB12: .cfi_startproc vmovups (%rsi), %ymm1 vmovaps .LC1(%rip), %ymm3 vrsqrtps %ymm1, %ymm2 vmovaps .LC0(%rip), %ymm4 vmovups 32(%rsi), %ymm0 vmulps %ymm1, %ymm2, %ymm1 vmulps %ymm2, %ymm1, %ymm1 vmulps %ymm3, %ymm2, %ymm2 vaddps %ymm4, %ymm1, %ymm1 vmulps %ymm2, %ymm1, %ymm1 vmovups %ymm1, (%rdi) vrsqrtps %ymm0, %ymm1 vmulps %ymm0, %ymm1, %ymm0 vmulps %ymm1, %ymm0, %ymm0 vmulps %ymm3, %ymm1, %ymm1 vaddps %ymm4, %ymm0, %ymm0 vmulps %ymm1, %ymm0, %ymm0 vmovups %ymm0, 32(%rdi) vzeroupper so the issue lies somewhere in the backend. Of the "fast" you need -ffinite-math-only -fno-math-errno -funsafe-math-optimizations. GCC definitely fails to see the FMA use as opportunity in ix86_emit_swsqrtsf, the a == 0 checking is because of the missing expander w/o avx512er where we could still use the NR sequence with the other instruction. HJ?