[Bug target/113353] New: Wrong rounding in std::nearbyint when vectorized with -funsafe-math-optimizations on PPC

2024-01-12 Thread alexander.grund--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113353

Bug ID: 113353
   Summary: Wrong rounding in std::nearbyint when vectorized with
-funsafe-math-optimizations on PPC
   Product: gcc
   Version: 12.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: alexander.gr...@tu-dresden.de
  Target Milestone: ---

According to the standard `std::nearbyint` and `std::rint` are supposed to give
the same results. However the former does not tie to even in halfway cases when
used in a tight loop.

The following code basically rounding "-8.5" produces `-9` when compiled with
`-funsafe-math-optimizations` but (as actually correct) `-8` when compiled
without it or when using `std::rint` instead. (Additional flags: `-O3
-mcpu=power9`)

This is at least unexpected given that
> The only difference between std::nearbyint and std::rint is that 
> std::nearbyint never raises FE_INEXACT. 

which in this case does not hold.

#include 
#include 
#include 

extern "C" void kernel(const double* in_ptr,
   double* out_ptr,
   const long ks)
{
for(long i=0; i in(16, -8.5);
decltype(in) out(in.size());
kernel(in.data(), out.data(), in.size());
for(const auto t: out)
std::cout << t << ", ";
std::cout << '\n';
}

Is this an intended effect of -funsafe-math-optimizations? I'd expect to at
least be able to workaround this with -frounding-math or so.

[Bug c++/112513] Misoptimization of argument

2023-11-14 Thread alexander.grund--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112513

--- Comment #4 from Alexander Grund  ---
Thank you, I replaced that by

unsigned unused;
__cpuid(1, cpuid1.val, unused, unused, unused);

and it works in the setup I have.

[Bug c++/112513] Misoptimization of argument

2023-11-14 Thread alexander.grund--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112513

Alexander Grund  changed:

   What|Removed |Added

 Resolution|INVALID |FIXED

--- Comment #2 from Alexander Grund  ---
I'm having trouble understanding the exact syntax of `asm`. It looks like
`a(...)` in the input/output list refers to read/write of the EAX register and
"b" for EBX etc., doesn't it?
But in the clobber list I'd need to use `"eax"`, not a..d. I haven't been able
to confirm this via the docs
(https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#Clobbers-and-Scratch-Registers).
I.e. I would have expected to be able to use the same "a".."d" there too.

Anyway if I understand you correctly the correct instruction would be: `asm
volatile("cpuid" : "=a" (cpuid1.val) : "a" (1) : "ebx", "ecx", "edx",
"memory");`

Did I got that right?

[Bug c++/112513] New: Misoptimization of argument

2023-11-13 Thread alexander.grund--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112513

Bug ID: 112513
   Summary: Misoptimization of argument
   Product: gcc
   Version: 12.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: alexander.gr...@tu-dresden.de
  Target Milestone: ---

Created attachment 56569
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=56569=edit
Preprocessed source

In the NVIDIA NCCL library (https://github.com/NVIDIA/nccl) I came across a
SIGSEGV in __strncmp_sse42 that happens "sometimes" when compiled with GCC 12
in -O2 mode and higher but don't happen in lower modes or in GCC 11

The original stacktrace looks like this:
#0  0x2bd82e3a in __strcmp_sse42 () from /lib64/libc.so.6
#1  0x2aab18d83a6e in xmlGetAttrIndex (index=,
attrName=0x2aab18e2820c "familyid", node=0x2aae9c108160) at graph/xml.h:67
#2  xmlSetAttrInt (value=143, attrName=0x2aab18e2820c "familyid",
node=0x2aae9c108160) at graph/xml.h:167
#3  ncclTopoGetXmlFromCpu (cpuNode=cpuNode@entry=0x2aae9c108160,
xml=xml@entry=0x2aae9c0d1f20) at graph/xml.cc:436

Moving the `strncmp(key, attrName, MAX_STR_LEN) == 0` out into a separate
function to see the arguments in the debugger shows this backtrace:
#0  0x2bd83c00 in __strncmp_sse42 () from /lib64/libc.so.6
#1  0x2aab18d75a9f in cmpFromXml (attrName=0x89300800 , key=0x2aaeac107eb0 "numaid") at graph/xml.h:65
#2  xmlGetAttrIndex (index=, attrName=0x89300800 , node=0x2aaeac107db0) at
graph/xml.h:73
#3  xmlSetAttrInt (node=node@entry=0x2aaeac107db0,
attrName=attrName@entry=0x89300800 , value=143) at graph/xml.h:174
#4  0x2aab18d77de4 in ncclTopoGetXmlFromCpu
(cpuNode=cpuNode@entry=0x2aaeac107db0, xml=xml@entry=0x2aaeac0d1b70) at
graph/xml.cc:437

So it looks like the `attrName` parameter gets corrupted somehow. The callsite
of `xmlSetAttrInt` is `NCCLCHECK(xmlSetAttrInt(cpuNode, "familyid",
familyId));`, so that parameter is a string constant already used earlier by
`NCCLCHECK(xmlGetAttrIndex(cpuNode, "familyid", ));`

I suspect the `index` parameter to be involved.
Many modifications cause the bug to disappear, such as removing the `NCCLCHECK`
macro (basically an `if(error) return error;`-wrapper) or adding
fprintf-statements into xmlGetAttrIndex or cmpFromXml

The compile command is `g++ -fPIC -fvisibility=hidden -std=c++11 -O2 -g -ggdb3
-c graph/xml.cc`, the preprocessed source. Needs minimization but as it only
happens when compiled into a library used by a python package from a script I
don't know how. So I hope that there will be something obvious for someone
familiar with the optimization in GCC

[Bug target/112443] [12/13/14 Regression] Misoptimization of _mm256_blendv_epi8 intrinsic on avx512bw+avx512vl

2023-11-09 Thread alexander.grund--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112443

--- Comment #3 from Alexander Grund  ---
> I can confirm that the suggested patch can be applied to 12.2.0 and fixes
> the issue I observed

Also tested 12.1, 12.3, 13.1, 13.2 with this patch and it works (as expected)
too

[Bug target/112443] [12/13/14 Regression] Misoptimization of _mm256_blendv_epi8 intrinsic on avx512bw+avx512vl

2023-11-09 Thread alexander.grund--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112443

--- Comment #2 from Alexander Grund  ---
I can confirm that the suggested patch can be applied to 12.2.0 and fixes the
issue I observed

[Bug tree-optimization/112443] New: Misoptimization of _mm256_blendv_epi8 intrinsic on avx512bw+avx512vl

2023-11-08 Thread alexander.grund--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112443

Bug ID: 112443
   Summary: Misoptimization of _mm256_blendv_epi8 intrinsic on
avx512bw+avx512vl
   Product: gcc
   Version: 12.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: alexander.gr...@tu-dresden.de
  Target Milestone: ---

Created attachment 56533
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=56533=edit
Reproducer code extracted from actual source

I came around some piece of code in PyTorch using AVX2 intrinsics that is
misoptimized producing wrong results, when compiled for newer CPUS.
In particular I was able to reproduce this with `-mavx512bw -mavx512vl -O2`

We usually compile with `-march=native` which on the Sapphire Rapids system
enables the above AVX512 flags, but so does `-march=cannonlake` and above.

The piece of code in question is a call to `_mm256_blendv_epi8(a, b, mask)`
that seemingly produces inverted semantics, i.e. I have a mask with all bits
set and it returns a and for a mask with all bits unset it returns b.

It is also a bit complicated to reproduce as it seems to require hiding some
details behind a lambda called through `std::function`.
In the attached example a zero and one vector is created once and copied into
the lambda where it is reused for potentially many iterations (removing the
loop also reproduces the issue)
Either of the following actions causes the bug to disappear:
- Removing either of the 2 `-mavx512` flags
- Reducing to `-O1` or lower
- Moving the zero_vec inside the lambda (moving one_vec makes no difference)
- Not calling through std::function (either run the lambda directly or pass
through as a template param instead of std::function)
- `-DREGEN_MASK` to create a new mask through a (superflous)
`_mm256_cmpeq_epi8` against all 1 bits

Reproducing:
g++ -std=c++17 -mavx512bw -mavx512vl -O2 bug.cpp && ./a.out

Expected output (last line, first line shows the inverted semantic):
vec[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1]

Actual output:
vec[255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255]

[Bug tree-optimization/112370] -Wfree-nonheap-object in std::vector dtor on sapphirerapids with -O3

2023-11-06 Thread alexander.grund--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112370

--- Comment #3 from Alexander Grund  ---
Created attachment 56513
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=56513=edit
INVALID reduced example (after running cvise on the former example)

[Bug tree-optimization/112370] -Wfree-nonheap-object in std::vector dtor on sapphirerapids with -O3

2023-11-06 Thread alexander.grund--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112370

--- Comment #2 from Alexander Grund  ---
FWIW: I tried to run cvise on this however it created an invalid example where
indeed a non-heap pointer would be freed. I'll attach it anyway for reference
as it took hours to run the reduce

[Bug c++/112370] -Wfree-nonheap-object in std::vector dtor on sapphirerapids with -O3

2023-11-03 Thread alexander.grund--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112370

--- Comment #1 from Alexander Grund  ---
Created attachment 56503
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=56503=edit
Preprocessed source

[Bug c++/112370] New: -Wfree-nonheap-object in std::vector dtor on sapphirerapids with -O3

2023-11-03 Thread alexander.grund--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112370

Bug ID: 112370
   Summary: -Wfree-nonheap-object in std::vector dtor on
sapphirerapids with -O3
   Product: gcc
   Version: 12.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: alexander.gr...@tu-dresden.de
  Target Milestone: ---

I have some code (from PTorch) which shows a bogus -Wfree-nonheap-object
warning when compiled with `-O3 -march=sapphirerapids`:

In file included from
/software/GCCcore/12.2.0/include/c++/12.2.0/x86_64-pc-linux-gnu/bits/c++allocator.h:33,
 from
/software/GCCcore/12.2.0/include/c++/12.2.0/bits/allocator.h:46,
 from /software/GCCcore/12.2.0/include/c++/12.2.0/memory:64,
 from
/src/third_party/ideep/mkl-dnn/src/utils/compatible.hpp:23,
 from
/src/third_party/ideep/mkl-dnn/src/backend/dnnl/dnnl_backend.cpp:19:
In member function 'void std::__new_allocator<_Tp>::deallocate(_Tp*, size_type)
[with _Tp = long int]',
inlined from 'static void std::allocator_traits
>::deallocate(allocator_type&, pointer, size_type) [with _Tp = long int]' at
/software/GCCcore/12.2.0/include/c++/12.2.0/bits/alloc_traits.h:496:23,
inlined from 'void std::_Vector_base<_Tp, _Alloc>::_M_deallocate(pointer,
std::size_t) [with _Tp = long int; _Alloc = std::allocator]' at
/software/GCCcore/12.2.0/include/c++/12.2.0/bits/stl_vector.h:387:19,
inlined from 'std::_Vector_base<_Tp, _Alloc>::~_Vector_base() [with _Tp =
long int; _Alloc = std::allocator]' at
/software/GCCcore/12.2.0/include/c++/12.2.0/bits/stl_vector.h:366:15,
inlined from 'std::vector<_Tp, _Alloc>::~vector() [with _Tp = long int;
_Alloc = std::allocator]' at
/software/GCCcore/12.2.0/include/c++/12.2.0/bits/stl_vector.h:733:7,
inlined from 'virtual void
dnnl::graph::impl::dnnl_impl::bn_folding_t::execute(const dnnl::stream&, const
std::unordered_map&) const' at
/src/third_party/ideep/mkl-dnn/src/backend/dnnl/op_executable.hpp:1060:63:
/software/GCCcore/12.2.0/include/c++/12.2.0/bits/new_allocator.h:158:26: error:
'void operator delete(void*, std::size_t)' called on pointer '' with
nonzero offset [1, 9223372036854775800] [-Werror=free-nonheap-object]
  158 | _GLIBCXX_OPERATOR_DELETE(_GLIBCXX_SIZED_DEALLOC(__p, __n));
  | ~^~~~
In member function '_Tp* std::__new_allocator<_Tp>::allocate(size_type, const
void*) [with _Tp = long int]',
inlined from 'static _Tp* std::allocator_traits
>::allocate(allocator_type&, size_type) [with _Tp = long int]' at
/software/GCCcore/12.2.0/include/c++/12.2.0/bits/alloc_traits.h:464:28,
inlined from 'std::_Vector_base<_Tp, _Alloc>::pointer
std::_Vector_base<_Tp, _Alloc>::_M_allocate(std::size_t) [with _Tp = long int;
_Alloc = std::allocator]' at
/software/GCCcore/12.2.0/include/c++/12.2.0/bits/stl_vector.h:378:33,
inlined from 'void std::vector<_Tp,
_Alloc>::_M_range_initialize(_ForwardIterator, _ForwardIterator,
std::forward_iterator_tag) [with _ForwardIterator = const long int*; _Tp = long
int; _Alloc = std::allocator]' at
/software/GCCcore/12.2.0/include/c++/12.2.0/bits/stl_vector.h:1687:25,
inlined from 'std::vector<_Tp, _Alloc>::vector(_InputIterator,
_InputIterator, const allocator_type&) [with _InputIterator = const long int*;
 = void; _Tp = long int; _Alloc = std::allocator]' at /software/GCCcore/12.2.0/include/c++/12.2.0/bits/stl_vector.h:706:23,
inlined from 'dnnl::memory::dims dnnl::memory::desc::dims() const' at
/src/third_party/ideep/mkl-dnn/third_party/oneDNN/include/oneapi/dnnl/dnnl.hpp:2677:66,
inlined from 'virtual void
dnnl::graph::impl::dnnl_impl::bn_folding_t::execute(const dnnl::stream&, const
std::unordered_map&) const' at
/src/third_party/ideep/mkl-dnn/src/backend/dnnl/op_executable.hpp:1060:63:
/software/GCCcore/12.2.0/include/c++/12.2.0/bits/new_allocator.h:137:48: note:
returned from 'void* operator new(std::size_t)'
  137 | return static_cast<_Tp*>(_GLIBCXX_OPERATOR_NEW(__n *
sizeof(_Tp)));
  |  ~~^~~

Command used: g++ -O3 -march=sapphirerapids -Wall -Werror -std=gnu++17 -c -o
a.o pre.cpp

Preprocessed source is attached (rather large, sorry)

The bug does not show when using another optimization level or arch. E.g.
`-march=cooperlake` works.

[Bug c++/112301] [12/13/14 regression] Double destruction of returned object when exiting the scope causes an exception which gets rethrown since r12-6333-gb10e031458d541

2023-11-03 Thread alexander.grund--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112301

--- Comment #4 from Alexander Grund  ---
Thank you! Will there be backports for 12.x?

The current patch is conflicting due to "c++: enable NRVO from inner block
[PR51571]" and while it seems to be easy enough to backport both patches in
order I'm not sure about possible side effects due to that especially how some
additionally introduced NRVOs can affect interoperability of programs/libraries
compiled before and after that change.

[Bug c++/112301] New: Double destruction of returned object when exiting the scope causes an exception which gets rethrown

2023-10-30 Thread alexander.grund--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112301

Bug ID: 112301
   Summary: Double destruction of returned object when exiting the
scope causes an exception which gets rethrown
   Product: gcc
   Version: 12.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: alexander.gr...@tu-dresden.de
  Target Milestone: ---

Created attachment 56476
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=56476=edit
More complete example with logging pointers

I debugged a heap corruption I traced back to a use-after-free caused by an
extra destructor call.

I suspect the cause could be the fix for
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=33799 where a throwing destructor
led to a missing destructor call.
It could be similar to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=12751 which
also had an extra destructor call generated for an already destructed instance.

Minimized code sample:

#include 
#include 

int num = 0;
struct ptr{
ptr(){
++num;
}
ptr(ptr&&){
++num;
}
~ptr(){
assert(num-- > 0);
}
};

struct ThrowOnExit{
~ThrowOnExit() noexcept(false){
throw std::runtime_error("");
}
};

ptr foo(ptr x){
try{
ThrowOnExit _;
return x;
}catch (const std::exception&) {
throw;
}
}

void wrapper(){
try{
foo(ptr{});
}catch(const std::exception&){}
}

int main(){
wrapper();
}


The assertion fails, although it should not. Logging the constructions and
destructions and removing the assert gives me this:

construct 0x7ffd4538088e
move construct 0x7ffd4538088f
free 0x7ffd4538088f
free 0x7ffd4538088f
free 0x7ffd4538088e

[Bug target/100799] Stackoverflow in optimized code on PPC

2022-07-20 Thread alexander.grund--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100799

--- Comment #11 from Alexander Grund  ---
Some more experiments with GCC 10.3, OpenBLAS 0.3.15 and FlexiBLAS 3.0.4:

Baseline: Broken at -O1, working at -Og

I got it to break with "-Og -fmove-loop-invariants".
Then it worked again by adding "-fstack-protector-all". But that is seemingly
not advisable:
https://developers.redhat.com/blog/2020/05/22/stack-clash-mitigation-in-gcc-part-3

Hence the current workaround is to use "-O2 -fno-move-loop-invariants"

[Bug target/100799] Stackoverflow in optimized code on PPC

2022-07-20 Thread alexander.grund--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100799

--- Comment #10 from Alexander Grund  ---
(In reply to Peter Bergner from comment #2)
> The failure with GCC 7 and later coincides with the PPC port starting to
> default to LRA instead of reload.

Is there a compiler flag that can switch the default back as a workaround?

[Bug target/100799] Stackoverflow in optimized code on PPC

2022-07-08 Thread alexander.grund--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100799

--- Comment #7 from Alexander Grund  ---
Hi,
it's more than 1 year later now. Peter seemingly has a simple reproducer.
Is there anything new on this? Any patch to fix that or at least anything to
try or a workaround like disabling a specific optimization causing this?

Best Regards

[Bug ipa/54569] Compiling code with -O3 results to segfault in MAME/MESS binary

2021-07-15 Thread alexander.grund--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54569

Alexander Grund  changed:

   What|Removed |Added

 CC||alexander.grund@tu-dresden.
   ||de

--- Comment #3 from Alexander Grund  ---
Created attachment 51154
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51154=edit
preprocessed TBB source

Might have a related issue: https://github.com/oneapi-src/oneTBB/issues/489

The test segfaults when compiled with -O3 but works when compiled with
-fno-ipa-cp-clone

UBSAN reports "runtime error: execution reached an unreachable program point",
but only for the crashing version and Valgrind reports an invalid read, also
only for the crashing version.

Most notably the crash disappears when the direct call to the template function
is replaced by getting a pointer to it first and calling the function through
that on the next line, which shouldn't make any difference.

[Bug tree-optimization/101061] tree-vrp misoptimization on skylake+ using union-based aliasing

2021-06-16 Thread alexander.grund--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101061

--- Comment #13 from Alexander Grund  ---
> But what you can see is that the resulting pointer is used for the 
> initialization and not the placement address as literally written in source.

So I assume it was supposed to be "Y::Y (D_6557, 1);" ?

> I'm not sure how one can solve this issue with using placement new
> but are unions not sufficiently restricted so that copy assignment
> should work (and activate the appropriate union member)?  Thus
> 
>   slot->mutable_value = pair(k, v);

The problem is not the copy, the problem is that the value may contain any kind
of data, think e.g. a pair of strings. And at the initial point (i.e. first
emplace) the slot is a casted pointer into uninitialized data. I.e. the above
would be an assignment into an object which does not exist. And (especially)
for such non-trivial types this would break.

I think it will work for trivial types though, although it is UB due to
lifetime rules: You can't use an object (here: assign to) which has not started
its lifetime yet.

However e.g. pair has custom copy and regular constructors so I think it will
run into the issue you mentioned: The ctor will access the object via the
this-pointer and not via the full union-thing and hence might misoptimise later

This would mean that in conclusion the use case of putting std::pairs in an
union and accessing them via aliasing is unsupported by (at least) GCC. Is that
correct?

[Bug tree-optimization/101061] tree-vrp misoptimization on skylake+ using union-based aliasing

2021-06-16 Thread alexander.grund--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101061

Alexander Grund  changed:

   What|Removed |Added

Version|8.4.0   |8.5.0
  Known to fail||8.3.0, 8.4.0, 8.5.0
  Known to work||9.1.0

--- Comment #10 from Alexander Grund  ---
> My suspicion is, that GCC loads the value of slot2 before constructing the 
> object

I have just confirmed that: memsetting the whole region with a magic value
shows exactly that: Where a new value should be created and then returned, then
the memset-value is returned instead. When the map already has an entry for the
key (i.e. no new value created) then the existing value (i.e. the correct one)
is returned.

Question now: Is the in-place construction UB and just happens to "normally"
work or is that an optimizer bug that was fixed or gone latent? And how to
proceed to further narrow this down?

[Bug tree-optimization/101061] tree-vrp misoptimization on skylake+ using union-based aliasing

2021-06-16 Thread alexander.grund--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101061

--- Comment #9 from Alexander Grund  ---
> Note that when the union type is visible in the access path then GCC allows 
> punning without further restrictions. Thus the accesses as written above are 
> OK.

Now I have to ask again for clarification because this seemingly contradicts
your previous statement. So let me try to state the previous question again:

Given a pointer `slot_type* slot` (where slot_type is the union as defined
before) is it legal to access `slot->value.first`, `slot->mutable_value.first`,
`slot->key` interchangeably?

In all 3 accesses the union is visible in the access path (if I understood that
correctly) so the type punning should be ok, shouldn't it?

> the circumstances have been so there's incentive to do an invalid 
> transform... 

So is this indeed a GCC bug which may be fixed or gone latent?

Otherwise, maybe the construction is the problem? What Abseil basically does is
allocating a (suitably aligned) buffer of chars and in-place constructing those
slot_type unions/pairs there as needed (i.e. similar to std::vector)
After abstractions what basically happens is:

// alloc buffer done, then:
slot_type* slot_buffer = reinterpret_cast(buffer);

// on emplace:
size_t x = ...;
slot_type* slot = slot_buffer + x;
new(slot) slot_type;
new(>mutable_value) pair(k, v);
slot_type* slot2 = slot_buffer + x; // same as before, actually done through
the iterator
idx_vec[i] = slot2->value.second;


My suspicion is, that GCC loads the value of slot2 before constructing the
object, at least sometimes. Printing the values in the vector often shows a run
of zeroes before some other (potentially correct) values. I'd guess the correct
values happen, when no insertion took place, i.e. the construction was done
already in a prior loop iteration.

[Bug tree-optimization/101061] tree-vrp misoptimization on skylake+ using union-based aliasing

2021-06-15 Thread alexander.grund--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101061

--- Comment #6 from Alexander Grund  ---
Oh and for completeness: The same applies to the following union, doesn't it?
I.e. given this:

struct TF_TString_Raw {
  uint8_t raw[24];
};
struct TF_TString_Small {
  uint8_t size;
  char str[23];
};
struct TF_TString_Large { 
  size_t size;
  size_t cap;
  char *ptr;
};
union{
TF_TString_Raw raw;
TF_TString_Small smll;
TF_TString_Large large;
} x;  

it is not allowed to access x.raw.raw[0] AND then x.large.size (not even a
common initial subsequence) but also not x.raw.raw[0] AND then x.small.size
(maybe a common sequence, not sure about the array, but not implemented in GCC)

[Bug tree-optimization/101061] tree-vrp misoptimization on skylake+ using union-based aliasing

2021-06-15 Thread alexander.grund--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101061

--- Comment #5 from Alexander Grund  ---
So am I right assuming that the following is basically UB as per GCC (although
it should work as per the standard)?

template 
union slot_type {
  map_slot_type() {}
  ~map_slot_type() = delete;
  using value_type = std::pair;
  using mutable_value_type = std::pair;

  value_type value;
  mutable_value_type mutable_value;
  K key;
};

--> I.e. given a pointer `slot_type* slot` it _should_ have been possible to
access `slot->value.first`, `slot->mutable_value.first`, `slot->key`
interchangeably but in fact it is not, i.e. not implemented in GCC.

I'm asking because it "may" work, and e.g. seemingly does in GCC 9+, but
yeah... If that is indeed unsupported I'll open a bug report against the repo
using this. Note that this effectively disallows having "flat" maps that return
`std::pair` via their iterators but have support for moving items
effectively (i.e. via `std::pair` pointers)

[Bug tree-optimization/101061] tree-vrp misoptimization on skylake+ using union-based aliasing

2021-06-15 Thread alexander.grund--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101061

--- Comment #3 from Alexander Grund  ---
You are right, it actually seems to be the combination of those to, so -O2
-fno-strict-aliasing and -O2 -fno-tree-vrp both make it work.

The layout-compatible refers to the "common initial sequence" that is allowed
to be inspected for inactive union members, see the link.

I also tested GCC 8.5.0 but there it is not fixed. (BTW: I checked the Github
mirror which doesn't list 8.5.0) I'm aware that it is no longer supported but I
too think that it is a latent bug or maybe just outright unsupported/UB

> If you have ways to pinpoint the function where things go wrong then it might 
> be possible to spot a wrong transform there

In the provided source the bug happens in this 2 lines:
auto it = uniq.emplace(Tin(i), j);
idx_vec(i) = it.first->second;

Tin and idx_vec can be replaced by std::vector and std::vector
respectively and the bug still occurs. However there are quite some
abstractions especially in the first line including a conversion from their
custom string type to their custom string_view

I know it has got not main, because it is part of the tensorflow shared library
loaded into the Python process and run from there. I wasn't able to reduce this
further as mentioned as almost any change leads to this bug disappearing.

I see that the tree-vrp has had many string-related changes which may handle
their custom string better, see
https://github.com/tensorflow/tensorflow/blob/c405c59212c764485818fad8b34294c0b553e6fb/tensorflow/core/platform/ctstring_internal.h#L118-L121
However they are dangerously close to UB in many places, e.g.
https://github.com/tensorflow/tensorflow/blob/c405c59212c764485818fad8b34294c0b553e6fb/tensorflow/core/platform/ctstring_internal.h#L184-L186
(the code executed here) accesses an array of one union member and a variable
of another.

I'm also not sure if allocating a chunk of memory and in-place construction of
an union containing a const member is actually allowed.

I just want to know for sure, that either their code is non-compliant or there
is a bug in the compiler. I can do experiments with the source but lack the
knowledge to trace it through the compiler.

See also my issue in the TF repo:
https://github.com/tensorflow/tensorflow/issues/47179

[Bug tree-optimization/101061] tree-vrp misoptimization on skylake+ using union-based aliasing

2021-06-14 Thread alexander.grund--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101061

--- Comment #1 from Alexander Grund  ---
Created attachment 51006
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51006=edit
compressed preprocessed source

[Bug tree-optimization/101061] New: tree-vrp misoptimization on skylake+ using union-based aliasing

2021-06-14 Thread alexander.grund--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101061

Bug ID: 101061
   Summary: tree-vrp misoptimization on skylake+ using union-based
aliasing
   Product: gcc
   Version: 8.4.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: alexander.gr...@tu-dresden.de
  Target Milestone: ---

I see an algorithm returning wrong values which happens when using `-ftree-vrp`
(or -O2) and `-march=skylake`. This happens for 8.3.0 and 8.4.0 but not with
9.1 or any later version, but I haven't found a particular fix for this and am
unsure it is actually fixed or just doesn't occur due to unrelated instruction
reordering. Especially as it also disappears when using `-march=skylake
-mtune=broadwell`

The code core code is a loop:
  for (Eigen::Index i = 0, j = 0; i < N; ++i) {
auto it = uniq.emplace(Tin(i), j);
idx_vec(i) = it.first->second;
if (it.second) {
  ++j;
}
  }

where `uniq` is Abseils flat_hash_map over , Tin a
string_view-Tensor and idx_vec is an Eigen Tensor, but could as well be a
std::vector or even a raw-pointer (both tested)
I also tested to assign `it.first->second` into 2 arrays and only the first
would be wrong, so even when swapping them, then again the other (now first)
will be wrong.
Wrong values include zeros (most often) as well as seemingly fully random
values including very large ones.

The exact same algorithm (templated) works for other all types instead of
string_view, e.g. int.

I (think I) traced it down due to the flat-hash-map (i.e. unordered_map with a
contiguous storage) which at its core uses the following union:

template 
union map_slot_type {
  map_slot_type() {}
  ~map_slot_type() = delete;
  using value_type = std::pair;
  using mutable_value_type = std::pair;

  value_type value;
  mutable_value_type mutable_value;
  K key;
};


Over various templates when emplacing/constructing a map entry it calls:
`alloc->construct(>mutable_value, std::forward(args)...)`

The access (i.e. what the iterator at `it.first` will return when dereferenced)
 is done via `value_type& element(slot_type* slot) { return slot->value; }`

--> Construction is done via placement-new of the non-const pair member of the
union while access happens through the const pair member of the union.

According to https://timsong-cpp.github.io/cppwp/n3337/class.mem#19 (9.2.19)
this is allowed, because those types are layout-compatible. I'd suspect GCC 8
misses this (sometimes?).
When doing the construction via `>value` all works, in this case.

I'll attach the preprocessed source, although it is large. All attempts of mine
to reproduce this in a minimized example failed as even minor changes made it
disappear.

[Bug target/100703] __vector_pair and __vector_quad cannot be passed by reference

2021-06-03 Thread alexander.grund--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100703

--- Comment #3 from Alexander Grund  ---
I found that this was fixed in 10.3 and 11.1 by
https://github.com/gcc-mirror/gcc/commit/e2882e76089cecdc268d0835c54cabfa80b5b0be

So yes only happens in 10.2. Thanks for checking that!

[Bug target/100799] Stackoverflow in optimized code on PPC

2021-05-28 Thread alexander.grund--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100799

--- Comment #1 from Alexander Grund  ---
Confirmed to also break with GCC 7.3, 8.2, 8.3 but works with 6.3, 6.4, 6.5

[Bug fortran/100799] New: Stackoverflow in optimized code on PPC

2021-05-27 Thread alexander.grund--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100799

Bug ID: 100799
   Summary: Stackoverflow in optimized code on PPC
   Product: gcc
   Version: 10.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: fortran
  Assignee: unassigned at gcc dot gnu.org
  Reporter: alexander.gr...@tu-dresden.de
  Target Milestone: ---

Created attachment 50879
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50879=edit
Disassembly of dbgebal_ in debug and release modes

Quick summary of the use case: When using FlexiBLAS with OpenBLAS I noticed
corruption of the parameters passed to OpenBLAS functions. FlexiBLAS basically
provides a BLAS interface where each function is a stub that forwards the
arguments to a real BLAS lib, like OpenBLAS

Example:
void FC_GLOBAL(dgebal,DGEBAL)(char* job, blasint* n, double* a, blasint* lda,
blasint* ilo, blasint* ihi, double* scale, blasint* info)
{
void (*fn) (void* job, void* n, void* a, void* lda, void* ilo, void*
ihi, void* scale, void* info);

fn = current_backend->lapack.dgebal.f77_blas_function; 

fn((void*) job, (void*) n, (void*) a, (void*) lda, (void*) ilo,
(void*) ihi, (void*) scale, (void*) info); 

return;
}
void dgebal(char* job, blasint* n, double* a, blasint* lda, blasint* ilo,
blasint* ihi, double* scale, blasint* info)
__attribute__((alias(MTS(FC_GLOBAL(dgebal,DGEBAL);

Due to the alias and the real BLAS lib being loader after FlexiBLAS also the
calls from an OpenBLAS function to another OpenBLAS function get routed through
FlexiBLAS.

Now I noticed that the parameter "N" at
https://github.com/xianyi/OpenBLAS/blob/v0.3.15/lapack-netlib/SRC/dgeev.f#L369
gets messed up during the call at
https://github.com/xianyi/OpenBLAS/blob/v0.3.15/lapack-netlib/SRC/dgeev.f#L363
which I traced to FlexiBLAS pushing the register that holds it, calling the
OpenBLAS DGEBAL and restoring it afterwards but the stack entry where it came
from gets changed by DGEBAL

So the actual Bug here is that GCC generates code for DGEBAL which uses a write
outside of the allocated stack.

The dissassembly of the dgebal_ function shows "stdur1,-368(r1)" in the
prologue and "std r25,440(r1)" later, which is the instruction that
overwrites the saved register from the calling function.
As far as I can tell an offset of 440 onto r1, which is bigger than the 368
"allocated" by the stdu is invalid.
The line reported by GDB for the overwriting instruction is
https://github.com/xianyi/OpenBLAS/blob/v0.3.15/lapack-netlib/SRC/dgebal.f#L328

The command used to compile the file is: gfortran -fno-math-errno -Wall
-frecursive -fno-optimize-sibling-calls -m64 -fopenmp -fPIC -O2 -fno-fast-math
-mcpu=power9 -mtune=power9  -DUSE_OPENMP -fopenmp -fno-optimize-sibling-calls
-g  -c -o dgebal.o dgebal.f

Replacing the "O2" by "Og" changes the prologue to "stdur1,-336(r1)" and
the max offset used for std on r1 is 328. Using this works with FlexiBLAS,
hence I suspect an optimization issue which leads to more spills but doesn't
update the stack size.

Reproduced with GCC 10.2.0, 10.3.0, 11.1.0

[Bug target/100706] Invalid instructions in plt calls on PPC

2021-05-21 Thread alexander.grund--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100706

--- Comment #3 from Alexander Grund  ---
Ok, tested a few different bintuils versions. Bug occurs with 2.35 and 2.35.1
and seems to be fixed in 2.35.2

So I guess this can be closed

[Bug target/100706] Invalid instructions in plt calls on PPC

2021-05-21 Thread alexander.grund--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100706

--- Comment #2 from Alexander Grund  ---
Thanks for the reply. Although this was hard to test (need to interfere with
TFs build process which must use the gold linker) I managed to link that one
library manually with bfd and it did work. Hence yes, that is a gold linker
issue.

Any suggestions on how to proceed?

[Bug target/100706] New: Invalid instructions in plt calls on PPC

2021-05-20 Thread alexander.grund--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100706

Bug ID: 100706
   Summary: Invalid instructions in plt calls on PPC
   Product: gcc
   Version: 10.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: alexander.gr...@tu-dresden.de
  Target Milestone: ---

This turns up when compiling TensorFlow 2.5.0 on ppc9le. I wasn't yet able to
reduce the code to reproduce it with a small example, so only got the
following:

The relevant code is this:

mutex.h:
struct MuData {
  void* space[2];
};

struct mutex{
  mutex();
  MuData m_;
};

--

mutex.cc:
#include "mutex.h"
#include "nsync_mu.h"

static inline nsync::nsync_mu *mu_cast(MuData *mu) {
  return reinterpret_cast(mu);
}

mutex::mutex() { nsync::nsync_mu_init(mu_cast(_)); }

--


This is compiled into a shared library with the following compile command: `gcc
-U_FORTIFY_SOURCE -fstack-protector -fno-omit-frame-pointer -g0
'-D_FORTIFY_SOURCE=1' -DNDEBUG -ffunction-sections -fdata-sections -O3
'-mcpu=native' -fno-math-errno -fPIC '-std=c++14'
-fno-canonical-system-headers`
Linker flags used later are `-Wl,-no-as-needed -Wl,-z,relro,-z,now
-Wl,--gc-sections`

The nsync library is from https://github.com/google/nsync (version 1.24)

An instance of the mutex class is than later created by an application linked
against this shared library during init (i.e. in a static global context) and
that raises an SIGILL.

Doing that manually I see a plt_call being made which looks like this:

Dump of assembler code for function
.plt_call._ZN5nsync13nsync_mu_initEPNS_11nsync_mu_s_E:
   0x20074900 <+0>: std r2,24(r1)
   0x20074904 <+4>: ld  r12,-32584(r2)
   0x20074908 <+8>: mtctr   r12
   0x2007490c <+12>:bctr
   0x20074910 <+16>:.long 0x0
   0x20074914 <+20>:.long 0x0
   0x20074918 <+24>:.long 0x0
   0x2007491c <+28>:.long 0x0

This looks good and works (as mentioned my reduced example didn't run into the
problem)

The TF compiled version of this plt call looks like this:

Dump of assembler code for function
0003.plt_call._ZN5nsync13nsync_mu_initEPNS_11nsync_mu_s_E:
   0x200b19851660 <+0>: std r2,24(r1)
   0x200b19851664 <+4>: nop
=> 0x200b19851668 <+8>: .long 0x41004ce
   0x200b1985166c <+12>:lfdpf12,264(0)
   0x200b19851670 <+16>:mtctr   r12
   0x200b19851674 <+20>:bctr
   0x200b19851678 <+24>:.long 0x0
   0x200b1985167c <+28>:.long 0x0

As you can see something inserted a strange value into the asm code. I'm not
sure if gcc or the linker (gold linker used here) creates those plt calls, but
something is obviously wrong here.

I also checked another call into the nsync library: nsync::nsync_mu_lock
The plt call looks very similar:
Dump of assembler code for function
0003.plt_call._ZN5nsync13nsync_mu_lockEPNS_11nsync_mu_s_E:
   0x200b19851680 <+0>: std r2,24(r1)
   0x200b19851684 <+4>: nop
   0x200b19851688 <+8>: .long 0x41004ce
   0x200b1985168c <+12>:lfdpf12,240(0)
   0x200b19851690 <+16>:mtctr   r12
   0x200b19851694 <+20>:bctr
   0x200b19851698 <+24>:.long 0x0
   0x200b1985169c <+28>:.long 0x0

As you can see the constant inserted and everything else but the lfdp offset is
the same.

I hope that is enough to find the problem. I'm happy to provide more insight or
do some further tests if required. However I'm not a PPC expert so I have no
idea where to go further with that.

[Bug target/100703] __vector_pair and __vector_quad cannot be passed by reference

2021-05-20 Thread alexander.grund--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100703

--- Comment #1 from Alexander Grund  ---
It goes further: Even the usual conversion rules for pointer types don't apply:

void foo(__vector_pair*){}
void bar(const __vector_pair*){}
int main(){
  __vector_pair p;
  foo(p); // works
  bar(p); // fails
}

[Bug c++/100703] New: __vector_pair and __vector_quad cannot be passed by reference

2021-05-20 Thread alexander.grund--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100703

Bug ID: 100703
   Summary: __vector_pair and __vector_quad cannot be passed by
reference
   Product: gcc
   Version: 10.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: alexander.gr...@tu-dresden.de
  Target Milestone: ---

The PowerPC intrinsic types __vector_pair and __vector_quad cannot be passed by
reference but only by pointer.

Reproducer code:

void ploadRhsMMA(__vector_pair&){}
int main(){
  __vector_pair p;
  ploadRhsMMA(p);
}

This creates an error "invalid conversion from type »* __vector_pair«"

This error looks wrong to me: Why wouldn't I be able to pass a type by
reference when I can pass it by pointer? GCC can handle both the same so it is
at least inconsistent.

Furthermore this breaks the commonly used assumption, that the following
functions can take any argument:

template
void foo(T&&);
template
void bar(const T&);

This now breaks for those intrinsics and a workaround is painful as everything
needs to be covnerted to pointers when those are part of an overload set (for
example)

[Bug c++/100677] New: False positive unused-but-set-parameter warning when using VSX intrinsics

2021-05-19 Thread alexander.grund--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100677

Bug ID: 100677
   Summary: False positive unused-but-set-parameter warning when
using VSX intrinsics
   Product: gcc
   Version: 10.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: alexander.gr...@tu-dresden.de
  Target Milestone: ---

On some code like the below the unused-but-set-parameter warning is incorrectly
triggered:

class Vec256 {
  union {
struct {
  vfloat32 _vec0;
  vfloat32 _vec1;
};
struct {
  vbool32 _vecb0;
  vbool32 _vecb1;
};

  } __attribute__((__may_alias__));
  Vec256(float scalar)
  : _vec0{vec_splats(scalar)}, _vec1{vec_splats(scalar)} {}
};

GCC seems to think "scalar" is unused even though it is passed to "vec_splats"
which is defined as "#define vec_splats __builtin_vec_splats"

This seems to also apply to other intrinsics, see e.g.
https://qlf-gitlab.inria.fr/guenneba/eigenci-junit-test/-/blob/85330d21b58bccbf95c90e874db2a0f6775ff63b/Eigen/src/Core/arch/AltiVec/PacketMath.h#L271-275

[Bug c++/100448] New: internal compiler error: Segmentation fault

2021-05-06 Thread alexander.grund--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100448

Bug ID: 100448
   Summary: internal compiler error: Segmentation fault
   Product: gcc
   Version: 10.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: alexander.gr...@tu-dresden.de
  Target Milestone: ---

Created attachment 50764
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50764=edit
compressed preprocessed source

The attached file triggers a segfault.
It might be possible that the way it was created was erroneous but the compiler
should segfault for any input hence the report

Reproduced with 10.2.0, 10.3.0, 9.3.0 on 2 systems.

[Bug middle-end/98561] -Wstringop-overflow triggered when memcpy to single char and writing to differently sized array members

2021-01-06 Thread alexander.grund--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98561

--- Comment #2 from Alexander Grund  ---
I did some work to reduce this further: https://godbolt.org/z/sezTPs

For some reason it seems to be related to std::array. So may I suggest to
include the above (or the original reproducer) in the test cases to avoid
regressions?

[Bug middle-end/98561] New: -Wstringop-overflow triggered when memcpy to single char and writing to differently sized array members

2021-01-06 Thread alexander.grund--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98561

Bug ID: 98561
   Summary: -Wstringop-overflow triggered when memcpy to single
char and writing to differently sized array members
   Product: gcc
   Version: 10.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: alexander.gr...@tu-dresden.de
  Target Milestone: ---

Created attachment 49899
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49899=edit
Test case to trigger the warning in -O3

This looks very similar to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87296
but occurs in GCC 10.0 to 10.2, not before.

The test case triggers the bug when compiled with `gcc -O3` but NOT with `gcc
-O2 -ftree-vectorize` as in the above bug. Instead it is `gcc -O2 -fpeel-loops`

See reproducer on godbolt: https://godbolt.org/z/Ef1Pne

Interesting here is that it does not involve actual OOB access through VLA or
size-0/size-1 hacks and is extremely sensitive. The following changes make the
warning disappear:
- changing the array sizes (some combinations only)
- hiding the memcpy
- using C-Arrays
- using a single array
- removing either for-loop

Also the size of the 2nd array matters for the amount of warnings shown for
that line.

It seems it is already fixed on trunk but all mentioned issues so far involved
a VLA or OOB access so this is not a direct duplicate.

Of course the test code is a reduced MWE but you can guess the purpose.

[Bug target/98140] Reused register by xsmincdp leads to wrong NaN propagation on Power9

2020-12-04 Thread alexander.grund--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98140

--- Comment #1 from Alexander Grund  ---
It looks like this was fixed in 10.1 by this commit
https://github.com/gcc-mirror/gcc/commit/37e0df8a9be5a8232f4ccb73cdadb02121ba523f

However the codegen looks worse:

 390:   20 00 9e c3 lfs f28,32(r30)
 394:   00 e0 9e ff fcmpu   cr7,f30,f28
 398:   10 00 9c 41 blt cr7,3a8
 39c:   40 00 9e c3 lfs f28,64(r30)
 3a0:   58 e0 1e f0 xscmpgtdp vs0,vs30,vs28
 3a4:   30 e0 9e f3 xxsel   vs28,vs30,vs28,vs0

When switching an (a > b) ? b : a to an xsmincdp, the arguments must be swapped
to honor the NaN rules. This would allow this to be used for the `HONOR_NANS
(compare_mode)` case. However it still ignores signed zeros. Maybe xsmindp
would be a better fit as it preserves the signed zeros. Only downside I see is
that it converts sNan to qNan which may be an issue.

[Bug target/98140] New: Reused register by xsmincdp leads to wrong NaN propagation on Power9

2020-12-04 Thread alexander.grund--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98140

Bug ID: 98140
   Summary: Reused register by xsmincdp leads to wrong NaN
propagation on Power9
   Product: gcc
   Version: 8.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: alexander.gr...@tu-dresden.de
  Target Milestone: ---

Created attachment 49679
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49679=edit
(preprocessed) source code to reproduce issue

Summary: xsmincdp instructions are generated in a form like `xsmincdp b,a,b`
for code that looks like `(a > b) ? b : a`

I was debugging an issue in PyTorch
(https://github.com/pytorch/pytorch/issues/48591) where I encountered the
following problem:

A clamp function is used which looks like this:

c[i] = a[i] < min_vec[i] ? min_vec[i] : (a[i] > max_vec[i] ? max_vec[i] :
a[i]);

This is used in very complex code using multiple levels of C++ templates,
lambdas and such and uses a combination of manually unrolled loops and
unroll-friendly loops (i.e. the above is called in a loop with a fixed trip
count of 8)

The generated ASM code has this (using objdump):
c[i] = a[i] < min_vec[i] ? min_vec[i] : (a[i] > max_vec[i] ? max_vec[i] :
a[i]);
   8e970:   20 00 fe cb lfd f31,32(r30)
   8e974:   00 f8 9c ff fcmpu   cr7,f28,f31
   8e978:   0c 00 9c 41 blt cr7,8e984 
   8e97c:   40 00 fe cb lfd f31,64(r30)
   8e980:   40 fc fc f3 xsmincdp vs31,vs28,vs31
   8e984:   28 00 9e cb lfd f28,40(r30)


So I assume f28/vs28 contains a[i] and vs31 contains max_vec[i], so the
instruction generated looks like `xsmincdp max_vec,a,max_vec` which on NaN will
return max_vec. However in the source code a should be returned due to the
condition evaluating to false when a NaN is involved.

Reproducing this is tricky, as it depends on many conditions. From my
observations I assume some register pressure is required and even some other
function also calling that code, so maybe some side effects from there. Using
GCC 10.2.0 I wasn't able to reproduce this as the codegen is slightly
different: Seemingly it notices that max_vec contains the same value for all i
and reuses a single register:
 324:   00 70 1f fc fcmpu   cr0,f31,f14
 328:   90 f8 a0 fe fmr f21,f31
 32c:   08 00 81 41 bgt 334
 330:   40 74 be f2 xsmincdp vs21,vs30,vs14
 334:   00 78 1f fc fcmpu   cr0,f31,f15
 338:   90 f8 c0 fe fmr f22,f31
 33c:   08 00 81 41 bgt 344
 340:   40 7c de f2 xsmincdp vs22,vs30,vs15

I'm attaching some source code which can be compiled using PyTorch 1.7.0 and 2
examples of preprocessed code which yield the above when compiled using `g++ 
-mcpu=power9 -g -std=gnu++14 -O3`