https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713
--- Comment #22 from Chris Elrod <elrodc at gmail dot com> --- Okay. I did that, and the time went from about 4.25 microseconds down to 4.0 microseconds. So that is an improvement, but accounts for only a small part of the difference with the LLVM-compilers. -O3 -fno-math-errno was about 3.5 microseconds, so -funsafe-math-optimizations still results in a regression in this code. 3.5 microseconds is roughly as fast as you can get with vsqrt and div. My best guess now is that gcc does a lot more to improve the accuracy of vsqrt. If I understand correctly, these are all the involved instructions: vmovaps .LC2(%rip), %zmm7 vmovaps .LC3(%rip), %zmm6 # for loop begins vrsqrt14ps %zmm1, %zmm2 # comparison and mask removed vmulps %zmm1, %zmm2, %zmm0 vmulps %zmm2, %zmm0, %zmm1 vmulps %zmm6, %zmm0, %zmm0 vaddps %zmm7, %zmm1, %zmm1 vmulps %zmm0, %zmm1, %zmm1 vrcp14ps %zmm1, %zmm0 vmulps %zmm1, %zmm0, %zmm1 vmulps %zmm1, %zmm0, %zmm1 vaddps %zmm0, %zmm0, %zmm0 vsubps %zmm1, %zmm0, %zmm0 vfnmadd213ps (%r10,%rax), %zmm0, %zmm2 If I understand this correctly: zmm2 =(approx) 1 / sqrt(zmm1) zmm0 = zmm1 * zmm2 = (approx) sqrt(zmm1) zmm1 = zmm0 * zmm2 = (approx) 1 zmm0 = zmm6 * zmm0 = (approx) constant6 * sqrt(zmm1) zmm1 = zmm7 * zmm1 = (approx) constant7 zmm1 = zmm0 * zmm1 = (approx) constant6 * constant6 * sqrt(zmm1) zmm0 = (approx) 1 / zmm1 = (approx) 1 / sqrt(zmm1) * 1 / (constant6 * constant7) zmm1 = zmm1 * zmm0 = (approx) 1 zmm1 = zmm1 * zmm0 = (approx) 1 / sqrt(zmm1) * 1 / (constant6 * constant7) zmm0 = 2 * zmm0 = (approx) 2 / sqrt(zmm1) * 2 / (constant6 * constant7) zmm0 = zmm1 - zmm0 = (approx) -1 / sqrt(zmm1) * 1 / (constant6 * constant7) which implies that constant6 * constant6 = approximately -1? LLVM seems to do a much simpler / briefer update of the output of vrsqrt. When I implemented a vrsqrt intrinsic in a Julia library, I just looked at Wikipedia and did (roughly): constant1 = -0.5 constant2 = 1.5 zmm2 = (approx) 1 / sqrt(zmm1) zmm3 = constant * zmm1 zmm1 = zmm2 * zmm2 zmm3 = zmm3 * zmm1 + constant2 zmm2 = zmm2 * zmm3 I am not a numerical analyst, so I can't comment on relative validities or accuracies of these approaches. I also don't know what LLVM 7+ does. LLVM 6 doesn't use vrsqrt. I would be interesting in reading explanations or discussions, if any are available.