https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #28 from Chris Elrod <elrodc at gmail dot com> ---
Created attachment 45501
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45501&action=edit
Minimum working example of the rsqrt problem. Can be compiled with: gcc -Ofast
-S -march=skylake-avx512 -mprefer-vector-width=512 -shared -fPIC rsqrt.c -o
rsqrt.s

I attached a minimum working example, demonstrating the problem of excessive
code generation for reciprocal square root, in the file rsqrt.c.
You can compile with:

gcc -Ofast -S -march=skylake-avx512 -mprefer-vector-width=512 -shared -fPIC
rsqrt.c -o rsqrt.s

clang -Ofast -S -march=skylake-avx512 -mprefer-vector-width=512 -shared -fPIC
rsqrt.c -o rsqrt.s

Or compare the asm of both on Godbolt: https://godbolt.org/z/c7Z0En

For gcc:

        vmovups (%rsi), %zmm0
        vxorps  %xmm1, %xmm1, %xmm1
        vcmpps  $4, %zmm0, %zmm1, %k1
        vrsqrt14ps      %zmm0, %zmm1{%k1}{z}
        vmulps  %zmm0, %zmm1, %zmm2
        vmulps  %zmm1, %zmm2, %zmm0
        vmulps  .LC1(%rip), %zmm2, %zmm2
        vaddps  .LC0(%rip), %zmm0, %zmm0
        vmulps  %zmm2, %zmm0, %zmm0
        vrcp14ps        %zmm0, %zmm1
        vmulps  %zmm0, %zmm1, %zmm0
        vmulps  %zmm0, %zmm1, %zmm0
        vaddps  %zmm1, %zmm1, %zmm1
        vsubps  %zmm0, %zmm1, %zmm0
        vmovups %zmm0, (%rdi)

for Clang:

        vmovups (%rsi), %zmm0
        vrsqrt14ps      %zmm0, %zmm1
        vmulps  %zmm1, %zmm0, %zmm0
        vfmadd213ps     .LCPI0_0(%rip){1to16}, %zmm1, %zmm0 # zmm0 = (zmm1 *
zmm0) + mem
        vmulps  .LCPI0_1(%rip){1to16}, %zmm1, %zmm1
        vmulps  %zmm0, %zmm1, %zmm0
        vmovups %zmm0, (%rdi)

Clang looks like it is is doing
 /*     rsqrt(a) = -0.5     * rsqrtss(a) * (a * rsqrtss(a) * rsqrtss(a) - 3.0)
*/

where .LCPI0_0(%rip) = -3.0 and LCPI0_1(%rip) = -0.5.
gcc is doing much more, and fairly different.

Reply via email to