https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98140

            Bug ID: 98140
           Summary: Reused register by xsmincdp leads to wrong NaN
                    propagation on Power9
           Product: gcc
           Version: 8.3.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: alexander.gr...@tu-dresden.de
  Target Milestone: ---

Created attachment 49679
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49679&action=edit
(preprocessed) source code to reproduce issue

Summary: xsmincdp instructions are generated in a form like `xsmincdp b,a,b`
for code that looks like `(a > b) ? b : a`

I was debugging an issue in PyTorch
(https://github.com/pytorch/pytorch/issues/48591) where I encountered the
following problem:

A clamp function is used which looks like this:

c[i] = a[i] < min_vec[i] ? min_vec[i] : (a[i] > max_vec[i] ? max_vec[i] :
a[i]);

This is used in very complex code using multiple levels of C++ templates,
lambdas and such and uses a combination of manually unrolled loops and
unroll-friendly loops (i.e. the above is called in a loop with a fixed trip
count of 8)

The generated ASM code has this (using objdump):
    c[i] = a[i] < min_vec[i] ? min_vec[i] : (a[i] > max_vec[i] ? max_vec[i] :
a[i]);
   8e970:       20 00 fe cb     lfd     f31,32(r30)
   8e974:       00 f8 9c ff     fcmpu   cr7,f28,f31
   8e978:       0c 00 9c 41     blt     cr7,8e984 
   8e97c:       40 00 fe cb     lfd     f31,64(r30)
   8e980:       40 fc fc f3     xsmincdp vs31,vs28,vs31
   8e984:       28 00 9e cb     lfd     f28,40(r30)


So I assume f28/vs28 contains a[i] and vs31 contains max_vec[i], so the
instruction generated looks like `xsmincdp max_vec,a,max_vec` which on NaN will
return max_vec. However in the source code a should be returned due to the
condition evaluating to false when a NaN is involved.

Reproducing this is tricky, as it depends on many conditions. From my
observations I assume some register pressure is required and even some other
function also calling that code, so maybe some side effects from there. Using
GCC 10.2.0 I wasn't able to reproduce this as the codegen is slightly
different: Seemingly it notices that max_vec contains the same value for all i
and reuses a single register:
     324:       00 70 1f fc     fcmpu   cr0,f31,f14
     328:       90 f8 a0 fe     fmr     f21,f31
     32c:       08 00 81 41     bgt     334
     330:       40 74 be f2     xsmincdp vs21,vs30,vs14
     334:       00 78 1f fc     fcmpu   cr0,f31,f15
     338:       90 f8 c0 fe     fmr     f22,f31
     33c:       08 00 81 41     bgt     344
     340:       40 7c de f2     xsmincdp vs22,vs30,vs15

I'm attaching some source code which can be compiled using PyTorch 1.7.0 and 2
examples of preprocessed code which yield the above when compiled using `g++ 
-mcpu=power9 -g -std=gnu++14 -O3`

Reply via email to