https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713
--- Comment #35 from Chris Elrod <elrodc at gmail dot com> --- > rsqrt: > .LFB12: > .cfi_startproc > vrsqrt28ps (%rsi), %zmm0 > vmovups %zmm0, (%rdi) > vzeroupper > ret > > (huh? isn't there a NR step missing?) > I assume because vrsqrt28ps is much more accurate than vrsqrt14ps, it wasn't considered necessary. Unfortunately, march=skylake-avx512 does not have -mavx512er, and therefore should use the less accurate vrsqrt14ps + NR step. I think vrsqrt14pd/s are -mavx512f or -mavx512vl > Without -mavx512er, we do not have an expander for rsqrtv16sf2, and without > that I don't know how the machinery can guess how to use rsqrt (there are > probably ways). Looking at the asm from only r[i] = sqrtf(a[i]): vmovups (%rsi), %zmm1 vxorps %xmm0, %xmm0, %xmm0 vcmpps $4, %zmm1, %zmm0, %k1 vrsqrt14ps %zmm1, %zmm0{%k1}{z} vmulps %zmm1, %zmm0, %zmm1 vmulps %zmm0, %zmm1, %zmm0 vmulps .LC1(%rip), %zmm1, %zmm1 vaddps .LC0(%rip), %zmm0, %zmm0 vmulps %zmm1, %zmm0, %zmm0 vmovups %zmm0, (%rdi) vs the asm from only r[i] = 1 /a[i]: vmovups (%rsi), %zmm1 vrcp14ps %zmm1, %zmm0 vmulps %zmm1, %zmm0, %zmm1 vmulps %zmm1, %zmm0, %zmm1 vaddps %zmm0, %zmm0, %zmm0 vsubps %zmm1, %zmm0, %zmm0 vmovups %zmm0, (%rdi) it looks like the expander is there for sqrt, and for inverse, and we're just getting both one after the other. So it does look like I could benchmark which one is slower than the regular instruction on my platform, if that would be useful.