https://gcc.gnu.org/bugzilla/show_bug.cgi?id=73350

            Bug ID: 73350
           Summary: AVX512: GCC optimizes away rounding flags
           Product: gcc
           Version: 7.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: wen...@mitsuba-renderer.org
  Target Milestone: ---

The AVX512 instruction set introduced the ability to specify a rounding flag
for almost every arithmetic operation that is subject to rounding. This is
extremely useful because it eliminates the need to mess around with the MXCSR
control register when using tools like interval arithmetic that need control of
rounding.

Unfortunately, support for this is currently broken in GCC. Specifically, the
GCC optimizer does not seem to distinguish between function variants with
different rounding modes and ends up merging them during common subexpression
elimination.

Consider the simple program attached below, which computes "1 + pi" with +inf
and -inf rounding modes and then prints the difference of these values. The
expected output is:

$ g++ test.c -o test -mavx512f -O0 -fomit-frame-pointer -fomit-frame-pointer &&
./test
-4.76837e-07

At optimization level, -O1, this currently stops working (tested with GCC
trunk):

$ g++ test.c -o test -mavx512f -O0 -fomit-frame-pointer -fomit-frame-pointer &&
./test
-4.76837e-07

Looking at the assembly, there are two surprising things: first, common
subexpression elimination seems to have (partially) merged the two additions.
The second add is still generated but its result is never used.

The other weird thing is that GCC decides to fill a mask register with '-1' and
then use the masked versions of these operations instead of using the unmasked
versions, which use a "-1" mask by default.

_main:
        leaq    8(%rsp), %r10
        andq    $-64, %rsp
        pushq   -8(%r10)
        pushq   %rbp
        movq    %rsp, %rbp
        pushq   %r10
        subq    $40, %rsp
        movl    $-1, %eax
        kmovw   %eax, %k1
        vbroadcastss    LC0(%rip), %zmm1
        vbroadcastss    LC1(%rip), %zmm2
        vaddps  {rd-sae}, %zmm2, %zmm1, %zmm0{%k1}{z} <------ Why use mask?
        vaddps  {ru-sae}, %zmm2, %zmm1, %zmm1{%k1}{z}
        vsubss  %xmm0, %xmm0, %xmm0                   <------ xmm0 ??????
        vcvtss2sd       %xmm0, %xmm0, %xmm0
        leaq    LC2(%rip), %rdi
        movl    $1, %eax
        call    _printf
        movl    $0, %eax
        addq    $40, %rsp
        popq    %r10
        popq    %rbp
        leaq    -8(%r10), %rsp
        ret

// ============== Program to reproduce ============

#include <stdio.h>
#include <math.h>
#include <immintrin.h>

int main(int argc, char *argv[]) {
    __m512 a = _mm512_set1_ps((float) M_PI);
    __m512 b = _mm512_set1_ps((float) 1.f);

    __m512 result1 = _mm512_add_round_ps(a, b, (_MM_FROUND_TO_NEG_INF |
_MM_FROUND_NO_EXC));
    __m512 result2 = _mm512_add_round_ps(a, b, (_MM_FROUND_TO_POS_INF |
_MM_FROUND_NO_EXC));

    printf("%g\n", result1[0] - result2[0]);

    return 0;
}

Reply via email to