https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98140
Bug ID: 98140 Summary: Reused register by xsmincdp leads to wrong NaN propagation on Power9 Product: gcc Version: 8.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: alexander.gr...@tu-dresden.de Target Milestone: --- Created attachment 49679 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49679&action=edit (preprocessed) source code to reproduce issue Summary: xsmincdp instructions are generated in a form like `xsmincdp b,a,b` for code that looks like `(a > b) ? b : a` I was debugging an issue in PyTorch (https://github.com/pytorch/pytorch/issues/48591) where I encountered the following problem: A clamp function is used which looks like this: c[i] = a[i] < min_vec[i] ? min_vec[i] : (a[i] > max_vec[i] ? max_vec[i] : a[i]); This is used in very complex code using multiple levels of C++ templates, lambdas and such and uses a combination of manually unrolled loops and unroll-friendly loops (i.e. the above is called in a loop with a fixed trip count of 8) The generated ASM code has this (using objdump): c[i] = a[i] < min_vec[i] ? min_vec[i] : (a[i] > max_vec[i] ? max_vec[i] : a[i]); 8e970: 20 00 fe cb lfd f31,32(r30) 8e974: 00 f8 9c ff fcmpu cr7,f28,f31 8e978: 0c 00 9c 41 blt cr7,8e984 8e97c: 40 00 fe cb lfd f31,64(r30) 8e980: 40 fc fc f3 xsmincdp vs31,vs28,vs31 8e984: 28 00 9e cb lfd f28,40(r30) So I assume f28/vs28 contains a[i] and vs31 contains max_vec[i], so the instruction generated looks like `xsmincdp max_vec,a,max_vec` which on NaN will return max_vec. However in the source code a should be returned due to the condition evaluating to false when a NaN is involved. Reproducing this is tricky, as it depends on many conditions. From my observations I assume some register pressure is required and even some other function also calling that code, so maybe some side effects from there. Using GCC 10.2.0 I wasn't able to reproduce this as the codegen is slightly different: Seemingly it notices that max_vec contains the same value for all i and reuses a single register: 324: 00 70 1f fc fcmpu cr0,f31,f14 328: 90 f8 a0 fe fmr f21,f31 32c: 08 00 81 41 bgt 334 330: 40 74 be f2 xsmincdp vs21,vs30,vs14 334: 00 78 1f fc fcmpu cr0,f31,f15 338: 90 f8 c0 fe fmr f22,f31 33c: 08 00 81 41 bgt 344 340: 40 7c de f2 xsmincdp vs22,vs30,vs15 I'm attaching some source code which can be compiled using PyTorch 1.7.0 and 2 examples of preprocessed code which yield the above when compiled using `g++ -mcpu=power9 -g -std=gnu++14 -O3`