[Bug target/113353] New: Wrong rounding in std::nearbyint when vectorized with -funsafe-math-optimizations on PPC
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113353 Bug ID: 113353 Summary: Wrong rounding in std::nearbyint when vectorized with -funsafe-math-optimizations on PPC Product: gcc Version: 12.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: alexander.gr...@tu-dresden.de Target Milestone: --- According to the standard `std::nearbyint` and `std::rint` are supposed to give the same results. However the former does not tie to even in halfway cases when used in a tight loop. The following code basically rounding "-8.5" produces `-9` when compiled with `-funsafe-math-optimizations` but (as actually correct) `-8` when compiled without it or when using `std::rint` instead. (Additional flags: `-O3 -mcpu=power9`) This is at least unexpected given that > The only difference between std::nearbyint and std::rint is that > std::nearbyint never raises FE_INEXACT. which in this case does not hold. #include #include #include extern "C" void kernel(const double* in_ptr, double* out_ptr, const long ks) { for(long i=0; i in(16, -8.5); decltype(in) out(in.size()); kernel(in.data(), out.data(), in.size()); for(const auto t: out) std::cout << t << ", "; std::cout << '\n'; } Is this an intended effect of -funsafe-math-optimizations? I'd expect to at least be able to workaround this with -frounding-math or so.
[Bug c++/112513] Misoptimization of argument
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112513 --- Comment #4 from Alexander Grund --- Thank you, I replaced that by unsigned unused; __cpuid(1, cpuid1.val, unused, unused, unused); and it works in the setup I have.
[Bug c++/112513] Misoptimization of argument
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112513 Alexander Grund changed: What|Removed |Added Resolution|INVALID |FIXED --- Comment #2 from Alexander Grund --- I'm having trouble understanding the exact syntax of `asm`. It looks like `a(...)` in the input/output list refers to read/write of the EAX register and "b" for EBX etc., doesn't it? But in the clobber list I'd need to use `"eax"`, not a..d. I haven't been able to confirm this via the docs (https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#Clobbers-and-Scratch-Registers). I.e. I would have expected to be able to use the same "a".."d" there too. Anyway if I understand you correctly the correct instruction would be: `asm volatile("cpuid" : "=a" (cpuid1.val) : "a" (1) : "ebx", "ecx", "edx", "memory");` Did I got that right?
[Bug c++/112513] New: Misoptimization of argument
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112513 Bug ID: 112513 Summary: Misoptimization of argument Product: gcc Version: 12.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: alexander.gr...@tu-dresden.de Target Milestone: --- Created attachment 56569 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=56569=edit Preprocessed source In the NVIDIA NCCL library (https://github.com/NVIDIA/nccl) I came across a SIGSEGV in __strncmp_sse42 that happens "sometimes" when compiled with GCC 12 in -O2 mode and higher but don't happen in lower modes or in GCC 11 The original stacktrace looks like this: #0 0x2bd82e3a in __strcmp_sse42 () from /lib64/libc.so.6 #1 0x2aab18d83a6e in xmlGetAttrIndex (index=, attrName=0x2aab18e2820c "familyid", node=0x2aae9c108160) at graph/xml.h:67 #2 xmlSetAttrInt (value=143, attrName=0x2aab18e2820c "familyid", node=0x2aae9c108160) at graph/xml.h:167 #3 ncclTopoGetXmlFromCpu (cpuNode=cpuNode@entry=0x2aae9c108160, xml=xml@entry=0x2aae9c0d1f20) at graph/xml.cc:436 Moving the `strncmp(key, attrName, MAX_STR_LEN) == 0` out into a separate function to see the arguments in the debugger shows this backtrace: #0 0x2bd83c00 in __strncmp_sse42 () from /lib64/libc.so.6 #1 0x2aab18d75a9f in cmpFromXml (attrName=0x89300800 , key=0x2aaeac107eb0 "numaid") at graph/xml.h:65 #2 xmlGetAttrIndex (index=, attrName=0x89300800 , node=0x2aaeac107db0) at graph/xml.h:73 #3 xmlSetAttrInt (node=node@entry=0x2aaeac107db0, attrName=attrName@entry=0x89300800 , value=143) at graph/xml.h:174 #4 0x2aab18d77de4 in ncclTopoGetXmlFromCpu (cpuNode=cpuNode@entry=0x2aaeac107db0, xml=xml@entry=0x2aaeac0d1b70) at graph/xml.cc:437 So it looks like the `attrName` parameter gets corrupted somehow. The callsite of `xmlSetAttrInt` is `NCCLCHECK(xmlSetAttrInt(cpuNode, "familyid", familyId));`, so that parameter is a string constant already used earlier by `NCCLCHECK(xmlGetAttrIndex(cpuNode, "familyid", ));` I suspect the `index` parameter to be involved. Many modifications cause the bug to disappear, such as removing the `NCCLCHECK` macro (basically an `if(error) return error;`-wrapper) or adding fprintf-statements into xmlGetAttrIndex or cmpFromXml The compile command is `g++ -fPIC -fvisibility=hidden -std=c++11 -O2 -g -ggdb3 -c graph/xml.cc`, the preprocessed source. Needs minimization but as it only happens when compiled into a library used by a python package from a script I don't know how. So I hope that there will be something obvious for someone familiar with the optimization in GCC
[Bug target/112443] [12/13/14 Regression] Misoptimization of _mm256_blendv_epi8 intrinsic on avx512bw+avx512vl
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112443 --- Comment #3 from Alexander Grund --- > I can confirm that the suggested patch can be applied to 12.2.0 and fixes > the issue I observed Also tested 12.1, 12.3, 13.1, 13.2 with this patch and it works (as expected) too
[Bug target/112443] [12/13/14 Regression] Misoptimization of _mm256_blendv_epi8 intrinsic on avx512bw+avx512vl
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112443 --- Comment #2 from Alexander Grund --- I can confirm that the suggested patch can be applied to 12.2.0 and fixes the issue I observed
[Bug tree-optimization/112443] New: Misoptimization of _mm256_blendv_epi8 intrinsic on avx512bw+avx512vl
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112443 Bug ID: 112443 Summary: Misoptimization of _mm256_blendv_epi8 intrinsic on avx512bw+avx512vl Product: gcc Version: 12.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: alexander.gr...@tu-dresden.de Target Milestone: --- Created attachment 56533 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=56533=edit Reproducer code extracted from actual source I came around some piece of code in PyTorch using AVX2 intrinsics that is misoptimized producing wrong results, when compiled for newer CPUS. In particular I was able to reproduce this with `-mavx512bw -mavx512vl -O2` We usually compile with `-march=native` which on the Sapphire Rapids system enables the above AVX512 flags, but so does `-march=cannonlake` and above. The piece of code in question is a call to `_mm256_blendv_epi8(a, b, mask)` that seemingly produces inverted semantics, i.e. I have a mask with all bits set and it returns a and for a mask with all bits unset it returns b. It is also a bit complicated to reproduce as it seems to require hiding some details behind a lambda called through `std::function`. In the attached example a zero and one vector is created once and copied into the lambda where it is reused for potentially many iterations (removing the loop also reproduces the issue) Either of the following actions causes the bug to disappear: - Removing either of the 2 `-mavx512` flags - Reducing to `-O1` or lower - Moving the zero_vec inside the lambda (moving one_vec makes no difference) - Not calling through std::function (either run the lambda directly or pass through as a template param instead of std::function) - `-DREGEN_MASK` to create a new mask through a (superflous) `_mm256_cmpeq_epi8` against all 1 bits Reproducing: g++ -std=c++17 -mavx512bw -mavx512vl -O2 bug.cpp && ./a.out Expected output (last line, first line shows the inverted semantic): vec[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] Actual output: vec[255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255]
[Bug tree-optimization/112370] -Wfree-nonheap-object in std::vector dtor on sapphirerapids with -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112370 --- Comment #3 from Alexander Grund --- Created attachment 56513 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=56513=edit INVALID reduced example (after running cvise on the former example)
[Bug tree-optimization/112370] -Wfree-nonheap-object in std::vector dtor on sapphirerapids with -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112370 --- Comment #2 from Alexander Grund --- FWIW: I tried to run cvise on this however it created an invalid example where indeed a non-heap pointer would be freed. I'll attach it anyway for reference as it took hours to run the reduce
[Bug c++/112370] -Wfree-nonheap-object in std::vector dtor on sapphirerapids with -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112370 --- Comment #1 from Alexander Grund --- Created attachment 56503 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=56503=edit Preprocessed source
[Bug c++/112370] New: -Wfree-nonheap-object in std::vector dtor on sapphirerapids with -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112370 Bug ID: 112370 Summary: -Wfree-nonheap-object in std::vector dtor on sapphirerapids with -O3 Product: gcc Version: 12.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: alexander.gr...@tu-dresden.de Target Milestone: --- I have some code (from PTorch) which shows a bogus -Wfree-nonheap-object warning when compiled with `-O3 -march=sapphirerapids`: In file included from /software/GCCcore/12.2.0/include/c++/12.2.0/x86_64-pc-linux-gnu/bits/c++allocator.h:33, from /software/GCCcore/12.2.0/include/c++/12.2.0/bits/allocator.h:46, from /software/GCCcore/12.2.0/include/c++/12.2.0/memory:64, from /src/third_party/ideep/mkl-dnn/src/utils/compatible.hpp:23, from /src/third_party/ideep/mkl-dnn/src/backend/dnnl/dnnl_backend.cpp:19: In member function 'void std::__new_allocator<_Tp>::deallocate(_Tp*, size_type) [with _Tp = long int]', inlined from 'static void std::allocator_traits >::deallocate(allocator_type&, pointer, size_type) [with _Tp = long int]' at /software/GCCcore/12.2.0/include/c++/12.2.0/bits/alloc_traits.h:496:23, inlined from 'void std::_Vector_base<_Tp, _Alloc>::_M_deallocate(pointer, std::size_t) [with _Tp = long int; _Alloc = std::allocator]' at /software/GCCcore/12.2.0/include/c++/12.2.0/bits/stl_vector.h:387:19, inlined from 'std::_Vector_base<_Tp, _Alloc>::~_Vector_base() [with _Tp = long int; _Alloc = std::allocator]' at /software/GCCcore/12.2.0/include/c++/12.2.0/bits/stl_vector.h:366:15, inlined from 'std::vector<_Tp, _Alloc>::~vector() [with _Tp = long int; _Alloc = std::allocator]' at /software/GCCcore/12.2.0/include/c++/12.2.0/bits/stl_vector.h:733:7, inlined from 'virtual void dnnl::graph::impl::dnnl_impl::bn_folding_t::execute(const dnnl::stream&, const std::unordered_map&) const' at /src/third_party/ideep/mkl-dnn/src/backend/dnnl/op_executable.hpp:1060:63: /software/GCCcore/12.2.0/include/c++/12.2.0/bits/new_allocator.h:158:26: error: 'void operator delete(void*, std::size_t)' called on pointer '' with nonzero offset [1, 9223372036854775800] [-Werror=free-nonheap-object] 158 | _GLIBCXX_OPERATOR_DELETE(_GLIBCXX_SIZED_DEALLOC(__p, __n)); | ~^~~~ In member function '_Tp* std::__new_allocator<_Tp>::allocate(size_type, const void*) [with _Tp = long int]', inlined from 'static _Tp* std::allocator_traits >::allocate(allocator_type&, size_type) [with _Tp = long int]' at /software/GCCcore/12.2.0/include/c++/12.2.0/bits/alloc_traits.h:464:28, inlined from 'std::_Vector_base<_Tp, _Alloc>::pointer std::_Vector_base<_Tp, _Alloc>::_M_allocate(std::size_t) [with _Tp = long int; _Alloc = std::allocator]' at /software/GCCcore/12.2.0/include/c++/12.2.0/bits/stl_vector.h:378:33, inlined from 'void std::vector<_Tp, _Alloc>::_M_range_initialize(_ForwardIterator, _ForwardIterator, std::forward_iterator_tag) [with _ForwardIterator = const long int*; _Tp = long int; _Alloc = std::allocator]' at /software/GCCcore/12.2.0/include/c++/12.2.0/bits/stl_vector.h:1687:25, inlined from 'std::vector<_Tp, _Alloc>::vector(_InputIterator, _InputIterator, const allocator_type&) [with _InputIterator = const long int*; = void; _Tp = long int; _Alloc = std::allocator]' at /software/GCCcore/12.2.0/include/c++/12.2.0/bits/stl_vector.h:706:23, inlined from 'dnnl::memory::dims dnnl::memory::desc::dims() const' at /src/third_party/ideep/mkl-dnn/third_party/oneDNN/include/oneapi/dnnl/dnnl.hpp:2677:66, inlined from 'virtual void dnnl::graph::impl::dnnl_impl::bn_folding_t::execute(const dnnl::stream&, const std::unordered_map&) const' at /src/third_party/ideep/mkl-dnn/src/backend/dnnl/op_executable.hpp:1060:63: /software/GCCcore/12.2.0/include/c++/12.2.0/bits/new_allocator.h:137:48: note: returned from 'void* operator new(std::size_t)' 137 | return static_cast<_Tp*>(_GLIBCXX_OPERATOR_NEW(__n * sizeof(_Tp))); | ~~^~~ Command used: g++ -O3 -march=sapphirerapids -Wall -Werror -std=gnu++17 -c -o a.o pre.cpp Preprocessed source is attached (rather large, sorry) The bug does not show when using another optimization level or arch. E.g. `-march=cooperlake` works.
[Bug c++/112301] [12/13/14 regression] Double destruction of returned object when exiting the scope causes an exception which gets rethrown since r12-6333-gb10e031458d541
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112301 --- Comment #4 from Alexander Grund --- Thank you! Will there be backports for 12.x? The current patch is conflicting due to "c++: enable NRVO from inner block [PR51571]" and while it seems to be easy enough to backport both patches in order I'm not sure about possible side effects due to that especially how some additionally introduced NRVOs can affect interoperability of programs/libraries compiled before and after that change.
[Bug c++/112301] New: Double destruction of returned object when exiting the scope causes an exception which gets rethrown
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112301 Bug ID: 112301 Summary: Double destruction of returned object when exiting the scope causes an exception which gets rethrown Product: gcc Version: 12.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: alexander.gr...@tu-dresden.de Target Milestone: --- Created attachment 56476 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=56476=edit More complete example with logging pointers I debugged a heap corruption I traced back to a use-after-free caused by an extra destructor call. I suspect the cause could be the fix for https://gcc.gnu.org/bugzilla/show_bug.cgi?id=33799 where a throwing destructor led to a missing destructor call. It could be similar to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=12751 which also had an extra destructor call generated for an already destructed instance. Minimized code sample: #include #include int num = 0; struct ptr{ ptr(){ ++num; } ptr(ptr&&){ ++num; } ~ptr(){ assert(num-- > 0); } }; struct ThrowOnExit{ ~ThrowOnExit() noexcept(false){ throw std::runtime_error(""); } }; ptr foo(ptr x){ try{ ThrowOnExit _; return x; }catch (const std::exception&) { throw; } } void wrapper(){ try{ foo(ptr{}); }catch(const std::exception&){} } int main(){ wrapper(); } The assertion fails, although it should not. Logging the constructions and destructions and removing the assert gives me this: construct 0x7ffd4538088e move construct 0x7ffd4538088f free 0x7ffd4538088f free 0x7ffd4538088f free 0x7ffd4538088e
[Bug target/100799] Stackoverflow in optimized code on PPC
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100799 --- Comment #11 from Alexander Grund --- Some more experiments with GCC 10.3, OpenBLAS 0.3.15 and FlexiBLAS 3.0.4: Baseline: Broken at -O1, working at -Og I got it to break with "-Og -fmove-loop-invariants". Then it worked again by adding "-fstack-protector-all". But that is seemingly not advisable: https://developers.redhat.com/blog/2020/05/22/stack-clash-mitigation-in-gcc-part-3 Hence the current workaround is to use "-O2 -fno-move-loop-invariants"
[Bug target/100799] Stackoverflow in optimized code on PPC
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100799 --- Comment #10 from Alexander Grund --- (In reply to Peter Bergner from comment #2) > The failure with GCC 7 and later coincides with the PPC port starting to > default to LRA instead of reload. Is there a compiler flag that can switch the default back as a workaround?
[Bug target/100799] Stackoverflow in optimized code on PPC
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100799 --- Comment #7 from Alexander Grund --- Hi, it's more than 1 year later now. Peter seemingly has a simple reproducer. Is there anything new on this? Any patch to fix that or at least anything to try or a workaround like disabling a specific optimization causing this? Best Regards
[Bug ipa/54569] Compiling code with -O3 results to segfault in MAME/MESS binary
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54569 Alexander Grund changed: What|Removed |Added CC||alexander.grund@tu-dresden. ||de --- Comment #3 from Alexander Grund --- Created attachment 51154 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51154=edit preprocessed TBB source Might have a related issue: https://github.com/oneapi-src/oneTBB/issues/489 The test segfaults when compiled with -O3 but works when compiled with -fno-ipa-cp-clone UBSAN reports "runtime error: execution reached an unreachable program point", but only for the crashing version and Valgrind reports an invalid read, also only for the crashing version. Most notably the crash disappears when the direct call to the template function is replaced by getting a pointer to it first and calling the function through that on the next line, which shouldn't make any difference.
[Bug tree-optimization/101061] tree-vrp misoptimization on skylake+ using union-based aliasing
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101061 --- Comment #13 from Alexander Grund --- > But what you can see is that the resulting pointer is used for the > initialization and not the placement address as literally written in source. So I assume it was supposed to be "Y::Y (D_6557, 1);" ? > I'm not sure how one can solve this issue with using placement new > but are unions not sufficiently restricted so that copy assignment > should work (and activate the appropriate union member)? Thus > > slot->mutable_value = pair(k, v); The problem is not the copy, the problem is that the value may contain any kind of data, think e.g. a pair of strings. And at the initial point (i.e. first emplace) the slot is a casted pointer into uninitialized data. I.e. the above would be an assignment into an object which does not exist. And (especially) for such non-trivial types this would break. I think it will work for trivial types though, although it is UB due to lifetime rules: You can't use an object (here: assign to) which has not started its lifetime yet. However e.g. pair has custom copy and regular constructors so I think it will run into the issue you mentioned: The ctor will access the object via the this-pointer and not via the full union-thing and hence might misoptimise later This would mean that in conclusion the use case of putting std::pairs in an union and accessing them via aliasing is unsupported by (at least) GCC. Is that correct?
[Bug tree-optimization/101061] tree-vrp misoptimization on skylake+ using union-based aliasing
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101061 Alexander Grund changed: What|Removed |Added Version|8.4.0 |8.5.0 Known to fail||8.3.0, 8.4.0, 8.5.0 Known to work||9.1.0 --- Comment #10 from Alexander Grund --- > My suspicion is, that GCC loads the value of slot2 before constructing the > object I have just confirmed that: memsetting the whole region with a magic value shows exactly that: Where a new value should be created and then returned, then the memset-value is returned instead. When the map already has an entry for the key (i.e. no new value created) then the existing value (i.e. the correct one) is returned. Question now: Is the in-place construction UB and just happens to "normally" work or is that an optimizer bug that was fixed or gone latent? And how to proceed to further narrow this down?
[Bug tree-optimization/101061] tree-vrp misoptimization on skylake+ using union-based aliasing
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101061 --- Comment #9 from Alexander Grund --- > Note that when the union type is visible in the access path then GCC allows > punning without further restrictions. Thus the accesses as written above are > OK. Now I have to ask again for clarification because this seemingly contradicts your previous statement. So let me try to state the previous question again: Given a pointer `slot_type* slot` (where slot_type is the union as defined before) is it legal to access `slot->value.first`, `slot->mutable_value.first`, `slot->key` interchangeably? In all 3 accesses the union is visible in the access path (if I understood that correctly) so the type punning should be ok, shouldn't it? > the circumstances have been so there's incentive to do an invalid > transform... So is this indeed a GCC bug which may be fixed or gone latent? Otherwise, maybe the construction is the problem? What Abseil basically does is allocating a (suitably aligned) buffer of chars and in-place constructing those slot_type unions/pairs there as needed (i.e. similar to std::vector) After abstractions what basically happens is: // alloc buffer done, then: slot_type* slot_buffer = reinterpret_cast(buffer); // on emplace: size_t x = ...; slot_type* slot = slot_buffer + x; new(slot) slot_type; new(>mutable_value) pair(k, v); slot_type* slot2 = slot_buffer + x; // same as before, actually done through the iterator idx_vec[i] = slot2->value.second; My suspicion is, that GCC loads the value of slot2 before constructing the object, at least sometimes. Printing the values in the vector often shows a run of zeroes before some other (potentially correct) values. I'd guess the correct values happen, when no insertion took place, i.e. the construction was done already in a prior loop iteration.
[Bug tree-optimization/101061] tree-vrp misoptimization on skylake+ using union-based aliasing
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101061 --- Comment #6 from Alexander Grund --- Oh and for completeness: The same applies to the following union, doesn't it? I.e. given this: struct TF_TString_Raw { uint8_t raw[24]; }; struct TF_TString_Small { uint8_t size; char str[23]; }; struct TF_TString_Large { size_t size; size_t cap; char *ptr; }; union{ TF_TString_Raw raw; TF_TString_Small smll; TF_TString_Large large; } x; it is not allowed to access x.raw.raw[0] AND then x.large.size (not even a common initial subsequence) but also not x.raw.raw[0] AND then x.small.size (maybe a common sequence, not sure about the array, but not implemented in GCC)
[Bug tree-optimization/101061] tree-vrp misoptimization on skylake+ using union-based aliasing
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101061 --- Comment #5 from Alexander Grund --- So am I right assuming that the following is basically UB as per GCC (although it should work as per the standard)? template union slot_type { map_slot_type() {} ~map_slot_type() = delete; using value_type = std::pair; using mutable_value_type = std::pair; value_type value; mutable_value_type mutable_value; K key; }; --> I.e. given a pointer `slot_type* slot` it _should_ have been possible to access `slot->value.first`, `slot->mutable_value.first`, `slot->key` interchangeably but in fact it is not, i.e. not implemented in GCC. I'm asking because it "may" work, and e.g. seemingly does in GCC 9+, but yeah... If that is indeed unsupported I'll open a bug report against the repo using this. Note that this effectively disallows having "flat" maps that return `std::pair` via their iterators but have support for moving items effectively (i.e. via `std::pair` pointers)
[Bug tree-optimization/101061] tree-vrp misoptimization on skylake+ using union-based aliasing
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101061 --- Comment #3 from Alexander Grund --- You are right, it actually seems to be the combination of those to, so -O2 -fno-strict-aliasing and -O2 -fno-tree-vrp both make it work. The layout-compatible refers to the "common initial sequence" that is allowed to be inspected for inactive union members, see the link. I also tested GCC 8.5.0 but there it is not fixed. (BTW: I checked the Github mirror which doesn't list 8.5.0) I'm aware that it is no longer supported but I too think that it is a latent bug or maybe just outright unsupported/UB > If you have ways to pinpoint the function where things go wrong then it might > be possible to spot a wrong transform there In the provided source the bug happens in this 2 lines: auto it = uniq.emplace(Tin(i), j); idx_vec(i) = it.first->second; Tin and idx_vec can be replaced by std::vector and std::vector respectively and the bug still occurs. However there are quite some abstractions especially in the first line including a conversion from their custom string type to their custom string_view I know it has got not main, because it is part of the tensorflow shared library loaded into the Python process and run from there. I wasn't able to reduce this further as mentioned as almost any change leads to this bug disappearing. I see that the tree-vrp has had many string-related changes which may handle their custom string better, see https://github.com/tensorflow/tensorflow/blob/c405c59212c764485818fad8b34294c0b553e6fb/tensorflow/core/platform/ctstring_internal.h#L118-L121 However they are dangerously close to UB in many places, e.g. https://github.com/tensorflow/tensorflow/blob/c405c59212c764485818fad8b34294c0b553e6fb/tensorflow/core/platform/ctstring_internal.h#L184-L186 (the code executed here) accesses an array of one union member and a variable of another. I'm also not sure if allocating a chunk of memory and in-place construction of an union containing a const member is actually allowed. I just want to know for sure, that either their code is non-compliant or there is a bug in the compiler. I can do experiments with the source but lack the knowledge to trace it through the compiler. See also my issue in the TF repo: https://github.com/tensorflow/tensorflow/issues/47179
[Bug tree-optimization/101061] tree-vrp misoptimization on skylake+ using union-based aliasing
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101061 --- Comment #1 from Alexander Grund --- Created attachment 51006 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51006=edit compressed preprocessed source
[Bug tree-optimization/101061] New: tree-vrp misoptimization on skylake+ using union-based aliasing
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101061 Bug ID: 101061 Summary: tree-vrp misoptimization on skylake+ using union-based aliasing Product: gcc Version: 8.4.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: alexander.gr...@tu-dresden.de Target Milestone: --- I see an algorithm returning wrong values which happens when using `-ftree-vrp` (or -O2) and `-march=skylake`. This happens for 8.3.0 and 8.4.0 but not with 9.1 or any later version, but I haven't found a particular fix for this and am unsure it is actually fixed or just doesn't occur due to unrelated instruction reordering. Especially as it also disappears when using `-march=skylake -mtune=broadwell` The code core code is a loop: for (Eigen::Index i = 0, j = 0; i < N; ++i) { auto it = uniq.emplace(Tin(i), j); idx_vec(i) = it.first->second; if (it.second) { ++j; } } where `uniq` is Abseils flat_hash_map over , Tin a string_view-Tensor and idx_vec is an Eigen Tensor, but could as well be a std::vector or even a raw-pointer (both tested) I also tested to assign `it.first->second` into 2 arrays and only the first would be wrong, so even when swapping them, then again the other (now first) will be wrong. Wrong values include zeros (most often) as well as seemingly fully random values including very large ones. The exact same algorithm (templated) works for other all types instead of string_view, e.g. int. I (think I) traced it down due to the flat-hash-map (i.e. unordered_map with a contiguous storage) which at its core uses the following union: template union map_slot_type { map_slot_type() {} ~map_slot_type() = delete; using value_type = std::pair; using mutable_value_type = std::pair; value_type value; mutable_value_type mutable_value; K key; }; Over various templates when emplacing/constructing a map entry it calls: `alloc->construct(>mutable_value, std::forward(args)...)` The access (i.e. what the iterator at `it.first` will return when dereferenced) is done via `value_type& element(slot_type* slot) { return slot->value; }` --> Construction is done via placement-new of the non-const pair member of the union while access happens through the const pair member of the union. According to https://timsong-cpp.github.io/cppwp/n3337/class.mem#19 (9.2.19) this is allowed, because those types are layout-compatible. I'd suspect GCC 8 misses this (sometimes?). When doing the construction via `>value` all works, in this case. I'll attach the preprocessed source, although it is large. All attempts of mine to reproduce this in a minimized example failed as even minor changes made it disappear.
[Bug target/100703] __vector_pair and __vector_quad cannot be passed by reference
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100703 --- Comment #3 from Alexander Grund --- I found that this was fixed in 10.3 and 11.1 by https://github.com/gcc-mirror/gcc/commit/e2882e76089cecdc268d0835c54cabfa80b5b0be So yes only happens in 10.2. Thanks for checking that!
[Bug target/100799] Stackoverflow in optimized code on PPC
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100799 --- Comment #1 from Alexander Grund --- Confirmed to also break with GCC 7.3, 8.2, 8.3 but works with 6.3, 6.4, 6.5
[Bug fortran/100799] New: Stackoverflow in optimized code on PPC
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100799 Bug ID: 100799 Summary: Stackoverflow in optimized code on PPC Product: gcc Version: 10.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: fortran Assignee: unassigned at gcc dot gnu.org Reporter: alexander.gr...@tu-dresden.de Target Milestone: --- Created attachment 50879 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50879=edit Disassembly of dbgebal_ in debug and release modes Quick summary of the use case: When using FlexiBLAS with OpenBLAS I noticed corruption of the parameters passed to OpenBLAS functions. FlexiBLAS basically provides a BLAS interface where each function is a stub that forwards the arguments to a real BLAS lib, like OpenBLAS Example: void FC_GLOBAL(dgebal,DGEBAL)(char* job, blasint* n, double* a, blasint* lda, blasint* ilo, blasint* ihi, double* scale, blasint* info) { void (*fn) (void* job, void* n, void* a, void* lda, void* ilo, void* ihi, void* scale, void* info); fn = current_backend->lapack.dgebal.f77_blas_function; fn((void*) job, (void*) n, (void*) a, (void*) lda, (void*) ilo, (void*) ihi, (void*) scale, (void*) info); return; } void dgebal(char* job, blasint* n, double* a, blasint* lda, blasint* ilo, blasint* ihi, double* scale, blasint* info) __attribute__((alias(MTS(FC_GLOBAL(dgebal,DGEBAL); Due to the alias and the real BLAS lib being loader after FlexiBLAS also the calls from an OpenBLAS function to another OpenBLAS function get routed through FlexiBLAS. Now I noticed that the parameter "N" at https://github.com/xianyi/OpenBLAS/blob/v0.3.15/lapack-netlib/SRC/dgeev.f#L369 gets messed up during the call at https://github.com/xianyi/OpenBLAS/blob/v0.3.15/lapack-netlib/SRC/dgeev.f#L363 which I traced to FlexiBLAS pushing the register that holds it, calling the OpenBLAS DGEBAL and restoring it afterwards but the stack entry where it came from gets changed by DGEBAL So the actual Bug here is that GCC generates code for DGEBAL which uses a write outside of the allocated stack. The dissassembly of the dgebal_ function shows "stdur1,-368(r1)" in the prologue and "std r25,440(r1)" later, which is the instruction that overwrites the saved register from the calling function. As far as I can tell an offset of 440 onto r1, which is bigger than the 368 "allocated" by the stdu is invalid. The line reported by GDB for the overwriting instruction is https://github.com/xianyi/OpenBLAS/blob/v0.3.15/lapack-netlib/SRC/dgebal.f#L328 The command used to compile the file is: gfortran -fno-math-errno -Wall -frecursive -fno-optimize-sibling-calls -m64 -fopenmp -fPIC -O2 -fno-fast-math -mcpu=power9 -mtune=power9 -DUSE_OPENMP -fopenmp -fno-optimize-sibling-calls -g -c -o dgebal.o dgebal.f Replacing the "O2" by "Og" changes the prologue to "stdur1,-336(r1)" and the max offset used for std on r1 is 328. Using this works with FlexiBLAS, hence I suspect an optimization issue which leads to more spills but doesn't update the stack size. Reproduced with GCC 10.2.0, 10.3.0, 11.1.0
[Bug target/100706] Invalid instructions in plt calls on PPC
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100706 --- Comment #3 from Alexander Grund --- Ok, tested a few different bintuils versions. Bug occurs with 2.35 and 2.35.1 and seems to be fixed in 2.35.2 So I guess this can be closed
[Bug target/100706] Invalid instructions in plt calls on PPC
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100706 --- Comment #2 from Alexander Grund --- Thanks for the reply. Although this was hard to test (need to interfere with TFs build process which must use the gold linker) I managed to link that one library manually with bfd and it did work. Hence yes, that is a gold linker issue. Any suggestions on how to proceed?
[Bug target/100706] New: Invalid instructions in plt calls on PPC
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100706 Bug ID: 100706 Summary: Invalid instructions in plt calls on PPC Product: gcc Version: 10.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: alexander.gr...@tu-dresden.de Target Milestone: --- This turns up when compiling TensorFlow 2.5.0 on ppc9le. I wasn't yet able to reduce the code to reproduce it with a small example, so only got the following: The relevant code is this: mutex.h: struct MuData { void* space[2]; }; struct mutex{ mutex(); MuData m_; }; -- mutex.cc: #include "mutex.h" #include "nsync_mu.h" static inline nsync::nsync_mu *mu_cast(MuData *mu) { return reinterpret_cast(mu); } mutex::mutex() { nsync::nsync_mu_init(mu_cast(_)); } -- This is compiled into a shared library with the following compile command: `gcc -U_FORTIFY_SOURCE -fstack-protector -fno-omit-frame-pointer -g0 '-D_FORTIFY_SOURCE=1' -DNDEBUG -ffunction-sections -fdata-sections -O3 '-mcpu=native' -fno-math-errno -fPIC '-std=c++14' -fno-canonical-system-headers` Linker flags used later are `-Wl,-no-as-needed -Wl,-z,relro,-z,now -Wl,--gc-sections` The nsync library is from https://github.com/google/nsync (version 1.24) An instance of the mutex class is than later created by an application linked against this shared library during init (i.e. in a static global context) and that raises an SIGILL. Doing that manually I see a plt_call being made which looks like this: Dump of assembler code for function .plt_call._ZN5nsync13nsync_mu_initEPNS_11nsync_mu_s_E: 0x20074900 <+0>: std r2,24(r1) 0x20074904 <+4>: ld r12,-32584(r2) 0x20074908 <+8>: mtctr r12 0x2007490c <+12>:bctr 0x20074910 <+16>:.long 0x0 0x20074914 <+20>:.long 0x0 0x20074918 <+24>:.long 0x0 0x2007491c <+28>:.long 0x0 This looks good and works (as mentioned my reduced example didn't run into the problem) The TF compiled version of this plt call looks like this: Dump of assembler code for function 0003.plt_call._ZN5nsync13nsync_mu_initEPNS_11nsync_mu_s_E: 0x200b19851660 <+0>: std r2,24(r1) 0x200b19851664 <+4>: nop => 0x200b19851668 <+8>: .long 0x41004ce 0x200b1985166c <+12>:lfdpf12,264(0) 0x200b19851670 <+16>:mtctr r12 0x200b19851674 <+20>:bctr 0x200b19851678 <+24>:.long 0x0 0x200b1985167c <+28>:.long 0x0 As you can see something inserted a strange value into the asm code. I'm not sure if gcc or the linker (gold linker used here) creates those plt calls, but something is obviously wrong here. I also checked another call into the nsync library: nsync::nsync_mu_lock The plt call looks very similar: Dump of assembler code for function 0003.plt_call._ZN5nsync13nsync_mu_lockEPNS_11nsync_mu_s_E: 0x200b19851680 <+0>: std r2,24(r1) 0x200b19851684 <+4>: nop 0x200b19851688 <+8>: .long 0x41004ce 0x200b1985168c <+12>:lfdpf12,240(0) 0x200b19851690 <+16>:mtctr r12 0x200b19851694 <+20>:bctr 0x200b19851698 <+24>:.long 0x0 0x200b1985169c <+28>:.long 0x0 As you can see the constant inserted and everything else but the lfdp offset is the same. I hope that is enough to find the problem. I'm happy to provide more insight or do some further tests if required. However I'm not a PPC expert so I have no idea where to go further with that.
[Bug target/100703] __vector_pair and __vector_quad cannot be passed by reference
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100703 --- Comment #1 from Alexander Grund --- It goes further: Even the usual conversion rules for pointer types don't apply: void foo(__vector_pair*){} void bar(const __vector_pair*){} int main(){ __vector_pair p; foo(p); // works bar(p); // fails }
[Bug c++/100703] New: __vector_pair and __vector_quad cannot be passed by reference
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100703 Bug ID: 100703 Summary: __vector_pair and __vector_quad cannot be passed by reference Product: gcc Version: 10.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: alexander.gr...@tu-dresden.de Target Milestone: --- The PowerPC intrinsic types __vector_pair and __vector_quad cannot be passed by reference but only by pointer. Reproducer code: void ploadRhsMMA(__vector_pair&){} int main(){ __vector_pair p; ploadRhsMMA(p); } This creates an error "invalid conversion from type »* __vector_pair«" This error looks wrong to me: Why wouldn't I be able to pass a type by reference when I can pass it by pointer? GCC can handle both the same so it is at least inconsistent. Furthermore this breaks the commonly used assumption, that the following functions can take any argument: template void foo(T&&); template void bar(const T&); This now breaks for those intrinsics and a workaround is painful as everything needs to be covnerted to pointers when those are part of an overload set (for example)
[Bug c++/100677] New: False positive unused-but-set-parameter warning when using VSX intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100677 Bug ID: 100677 Summary: False positive unused-but-set-parameter warning when using VSX intrinsics Product: gcc Version: 10.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: alexander.gr...@tu-dresden.de Target Milestone: --- On some code like the below the unused-but-set-parameter warning is incorrectly triggered: class Vec256 { union { struct { vfloat32 _vec0; vfloat32 _vec1; }; struct { vbool32 _vecb0; vbool32 _vecb1; }; } __attribute__((__may_alias__)); Vec256(float scalar) : _vec0{vec_splats(scalar)}, _vec1{vec_splats(scalar)} {} }; GCC seems to think "scalar" is unused even though it is passed to "vec_splats" which is defined as "#define vec_splats __builtin_vec_splats" This seems to also apply to other intrinsics, see e.g. https://qlf-gitlab.inria.fr/guenneba/eigenci-junit-test/-/blob/85330d21b58bccbf95c90e874db2a0f6775ff63b/Eigen/src/Core/arch/AltiVec/PacketMath.h#L271-275
[Bug c++/100448] New: internal compiler error: Segmentation fault
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100448 Bug ID: 100448 Summary: internal compiler error: Segmentation fault Product: gcc Version: 10.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: alexander.gr...@tu-dresden.de Target Milestone: --- Created attachment 50764 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50764=edit compressed preprocessed source The attached file triggers a segfault. It might be possible that the way it was created was erroneous but the compiler should segfault for any input hence the report Reproduced with 10.2.0, 10.3.0, 9.3.0 on 2 systems.
[Bug middle-end/98561] -Wstringop-overflow triggered when memcpy to single char and writing to differently sized array members
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98561 --- Comment #2 from Alexander Grund --- I did some work to reduce this further: https://godbolt.org/z/sezTPs For some reason it seems to be related to std::array. So may I suggest to include the above (or the original reproducer) in the test cases to avoid regressions?
[Bug middle-end/98561] New: -Wstringop-overflow triggered when memcpy to single char and writing to differently sized array members
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98561 Bug ID: 98561 Summary: -Wstringop-overflow triggered when memcpy to single char and writing to differently sized array members Product: gcc Version: 10.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: alexander.gr...@tu-dresden.de Target Milestone: --- Created attachment 49899 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49899=edit Test case to trigger the warning in -O3 This looks very similar to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87296 but occurs in GCC 10.0 to 10.2, not before. The test case triggers the bug when compiled with `gcc -O3` but NOT with `gcc -O2 -ftree-vectorize` as in the above bug. Instead it is `gcc -O2 -fpeel-loops` See reproducer on godbolt: https://godbolt.org/z/Ef1Pne Interesting here is that it does not involve actual OOB access through VLA or size-0/size-1 hacks and is extremely sensitive. The following changes make the warning disappear: - changing the array sizes (some combinations only) - hiding the memcpy - using C-Arrays - using a single array - removing either for-loop Also the size of the 2nd array matters for the amount of warnings shown for that line. It seems it is already fixed on trunk but all mentioned issues so far involved a VLA or OOB access so this is not a direct duplicate. Of course the test code is a reduced MWE but you can guess the purpose.
[Bug target/98140] Reused register by xsmincdp leads to wrong NaN propagation on Power9
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98140 --- Comment #1 from Alexander Grund --- It looks like this was fixed in 10.1 by this commit https://github.com/gcc-mirror/gcc/commit/37e0df8a9be5a8232f4ccb73cdadb02121ba523f However the codegen looks worse: 390: 20 00 9e c3 lfs f28,32(r30) 394: 00 e0 9e ff fcmpu cr7,f30,f28 398: 10 00 9c 41 blt cr7,3a8 39c: 40 00 9e c3 lfs f28,64(r30) 3a0: 58 e0 1e f0 xscmpgtdp vs0,vs30,vs28 3a4: 30 e0 9e f3 xxsel vs28,vs30,vs28,vs0 When switching an (a > b) ? b : a to an xsmincdp, the arguments must be swapped to honor the NaN rules. This would allow this to be used for the `HONOR_NANS (compare_mode)` case. However it still ignores signed zeros. Maybe xsmindp would be a better fit as it preserves the signed zeros. Only downside I see is that it converts sNan to qNan which may be an issue.
[Bug target/98140] New: Reused register by xsmincdp leads to wrong NaN propagation on Power9
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98140 Bug ID: 98140 Summary: Reused register by xsmincdp leads to wrong NaN propagation on Power9 Product: gcc Version: 8.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: alexander.gr...@tu-dresden.de Target Milestone: --- Created attachment 49679 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49679=edit (preprocessed) source code to reproduce issue Summary: xsmincdp instructions are generated in a form like `xsmincdp b,a,b` for code that looks like `(a > b) ? b : a` I was debugging an issue in PyTorch (https://github.com/pytorch/pytorch/issues/48591) where I encountered the following problem: A clamp function is used which looks like this: c[i] = a[i] < min_vec[i] ? min_vec[i] : (a[i] > max_vec[i] ? max_vec[i] : a[i]); This is used in very complex code using multiple levels of C++ templates, lambdas and such and uses a combination of manually unrolled loops and unroll-friendly loops (i.e. the above is called in a loop with a fixed trip count of 8) The generated ASM code has this (using objdump): c[i] = a[i] < min_vec[i] ? min_vec[i] : (a[i] > max_vec[i] ? max_vec[i] : a[i]); 8e970: 20 00 fe cb lfd f31,32(r30) 8e974: 00 f8 9c ff fcmpu cr7,f28,f31 8e978: 0c 00 9c 41 blt cr7,8e984 8e97c: 40 00 fe cb lfd f31,64(r30) 8e980: 40 fc fc f3 xsmincdp vs31,vs28,vs31 8e984: 28 00 9e cb lfd f28,40(r30) So I assume f28/vs28 contains a[i] and vs31 contains max_vec[i], so the instruction generated looks like `xsmincdp max_vec,a,max_vec` which on NaN will return max_vec. However in the source code a should be returned due to the condition evaluating to false when a NaN is involved. Reproducing this is tricky, as it depends on many conditions. From my observations I assume some register pressure is required and even some other function also calling that code, so maybe some side effects from there. Using GCC 10.2.0 I wasn't able to reproduce this as the codegen is slightly different: Seemingly it notices that max_vec contains the same value for all i and reuses a single register: 324: 00 70 1f fc fcmpu cr0,f31,f14 328: 90 f8 a0 fe fmr f21,f31 32c: 08 00 81 41 bgt 334 330: 40 74 be f2 xsmincdp vs21,vs30,vs14 334: 00 78 1f fc fcmpu cr0,f31,f15 338: 90 f8 c0 fe fmr f22,f31 33c: 08 00 81 41 bgt 344 340: 40 7c de f2 xsmincdp vs22,vs30,vs15 I'm attaching some source code which can be compiled using PyTorch 1.7.0 and 2 examples of preprocessed code which yield the above when compiled using `g++ -mcpu=power9 -g -std=gnu++14 -O3`