[Bug middle-end/116383] New: Value from __atomic_store not forwarded to non-atomic load at same address
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116383 Bug ID: 116383 Summary: Value from __atomic_store not forwarded to non-atomic load at same address Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- https://godbolt.org/z/1bbjoc87n int test(int* i, int val) { __atomic_store_n(i, val, __ATOMIC_RELAXED); return *i; } The non-atomic load should be able to directly use the value stored by the atomic store, but instead GCC issues a new load: mov DWORD PTR [rdi], esi mov eax, DWORD PTR [rdi] ret Clang recognizes that the load is unnecessary and propagates the value: mov eax, esi mov dword ptr [rdi], esi ret In addition to simply being an unnecessary load, there is an additionally penalty in most CPUs from accessing reading a value still in the CPU's store buffer which it almost certainly would be in this case. And of course this also disables further optimizations eg DSE and value propagation where the compiler knows something about the value. void blocking_further_optimizations(int* i) { if (test(i, 1) == 0) { __builtin_abort(); } } generates the following with gcc mov DWORD PTR [rdi], 1 mov edx, DWORD PTR [rdi] testedx, edx je .L5 ret blocking_further_optimizations(int*) [clone .cold]: .L5: pushrax callabort And this much better output with clang mov dword ptr [rdi], 1 ret While I'm using a relaxed store here to show that gcc doesn't apply the optimization in that case, I think the optimization should apply regardless of memory ordering (and clang seems to agree). Also while the minimal example code is contrived, there are several real-world use cases where this pattern can come up. I would expect it in cases where there is a single writer thread but many reader threads. The writes and off-thread reads need to use __atomic ops to avoid data races, but on-thread reads should be safe using ordinary loads, and you would want them to be optimized as much as possible.
[Bug c++/114357] Add a way to not call deconstructors for non-auto decls, like clang's no_destroy attribute
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114357 Mathias Stearn changed: What|Removed |Added CC||redbeard0531 at gmail dot com --- Comment #9 from Mathias Stearn --- For codebases that always use `quick_exit()` or `_Exit()` to exit, all static destructors are dead code bloat. For constinit function-statics with non-trivial destructors (that will never run) this results in unnecessary overhead each time the function is called: https://godbolt.org/z/jsjTYdMf3 > It changes behavior and even could cause memory leaks if abused. For static duration objects there is no such thing as a memory leak. The whole address space is going away. Allowing this on thread-local duration objects is probably a mistake though (with a possible exception for embedded and kernel-level code that does its own thread teardown).
[Bug tree-optimization/59429] Missed optimization opportunity in qsort-style comparison functions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59429 --- Comment #16 from Mathias Stearn --- Trunk still generates different code for all cases (in some cases subtly so) for both aarch64 and x86_64: https://www.godbolt.org/z/1s6sxrMWq. For both arches, it seems like LE and LG generate the best code. On aarch64, they probably all have the same throughput, but EL and EG have a size penalty with an extra instruction. On x86_64, there is much more variety. EL and EG both get end up with a branch rather than being branchless, which is probably bad since comparison functions are often called in ways that the result branches are unpredictable. GE and GL appear to have regressed since this ticket was created. They now do the comparison twice rather than reusing the flags from the first comparison: comGL(int, int): xor eax, eax cmp edi, esi mov edx, 1 setlal neg eax cmp edi, esi cmovg eax, edx ret comGE(int, int): xor eax, eax cmp edi, esi mov edx, 1 setne al neg eax cmp edi, esi cmovg eax, edx ret
[Bug target/113723] -freorder-blocks-and-partition emits bogus asm directives on aarch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113723 --- Comment #1 from Mathias Stearn --- As a workaround sticking [[gnu::optimize("no-reorder-blocks-and-partition")]] on each function that triggers this seems to work. But that is obviously not a scalable solution.
[Bug target/113723] New: -freorder-blocks-and-partition emits bogus asm directives on aarch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113723 Bug ID: 113723 Summary: -freorder-blocks-and-partition emits bogus asm directives on aarch64 Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- It tries to subtract labels across the function split which the assembler rejects. At the very least it does this when generating a switch jump table, but there may be other cases. https://www.godbolt.org/z/Ynh1zzY4a Repro with -O3 -freorder-blocks-and-partition: [[gnu::cold]]void cold(); template void f(); #define GENERATE_CASE case __COUNTER__: f<__COUNTER__>(); break void test(int i) { switch(i) { case 1: cold(); break; GENERATE_CASE; GENERATE_CASE; GENERATE_CASE; GENERATE_CASE; GENERATE_CASE; GENERATE_CASE; GENERATE_CASE; GENERATE_CASE; GENERATE_CASE; GENERATE_CASE; } } test(int): cmp w0, 18 bls .L17 .L1: ret .L17: adrpx1, .L4 add x1, x1, :lo12:.L4 ldrbw1, [x1,w0,uxtw] adr x0, .Lrtx4 add x1, x0, w1, sxtb #2 br x1 .Lrtx4: .L4: .byte (.L14 - .Lrtx4) / 4 .byte (.L13 - .Lrtx4) / 4 // <<<< THIS IS THE BAD ONE .byte (.L12 - .Lrtx4) / 4 ... test(int) [clone .cold]: .L13: b cold() /tmp/cck05x4u.s: Assembler messages: /tmp/cck05x4u.s:48: Error: invalid operands (.text.unlikely and .text sections) for `-' Compiler returned: 1 I've repro'd on 11,12, and trunk.
[Bug middle-end/113682] Branches in branchless binary search rather than cmov/csel/csinc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113682 --- Comment #5 from Mathias Stearn --- (In reply to Andrew Pinski from comment #2) > I should note that on x86, 2 cmov in a row might be an issue and worse than > branches. There is a cost model and the x86 backend rejects that. > > There are some cores where it is worse. I don't know if it applies to recent > ones though. Do you know if that applies to any cores that support x86_64? I checked Agner Fog's tables, and only very very old cores (P4 era) had high reciprocal throughput, but even then it was less than latency. It looks like all AMD cores and intel cores newer than ivy bridge (ie everything from the last 10 years) are able to execute multiple CMOVs per cycle (reciprocal throughput < 1). From what I can see, it looks like bad CMOV was a particular problem of the Pentium 4 and Prescott cores, and possibly PPro, but I don't see the numbers for it. I don't think any of those cores should have an impact on the default cost model in 2024.
[Bug other/113682] New: Branches in branchless binary search rather than cmov/csel/csinc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113682 Bug ID: 113682 Summary: Branches in branchless binary search rather than cmov/csel/csinc Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: other Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- I've been trying to eliminate unpredictable branches in a hot function where perf counters show a high mispredict percentage. Unfortunately, I haven't been able to find an incantation to get gcc to generate branchless code other than inline asm, which I'd rather avoid. In this case I've even laid out the critical lines so that they exactly match the behavior of the csinc and csel instructions on arm64, but they are still not used. Somewhat minimized repro: typedef unsigned long size_t; struct ITEM {char* addr; size_t len;}; int cmp(ITEM* user, ITEM* tree); size_t bsearch2(ITEM* user, ITEM** tree, size_t treeSize) { auto pos = tree; size_t low = 0; size_t high = treeSize; while (low < high) { size_t i = (low + high) / 2; int res = cmp(user, tree[i]); // These should be cmp + csinc + csel on arm // and lea + test + cmov + cmov on x86. low = res > 0 ? i + 1 : low; // csinc high = res < 0 ? i: high; // csel if (res == 0) return i; } return -1; } On arm64 that generates a conditional branch on res > 0: bl cmp(ITEM*, ITEM*) cmp w0, 0 bgt .L15 // does low = i + 1 then loops mov x20, x19 bne .L4 // loop On x86_64 it does similar: callcmp(ITEM*, ITEM*) testeax, eax jg .L16 jne .L17 Note that clang produces the desired codegen for both: arm: bl cmp(ITEM*, ITEM*) cmp w0, #0 csinc x23, x23, x22, le cselx19, x22, x19, lt cbnzw0, .LBB0_1 x86: callcmp(ITEM*, ITEM*)@PLT lea rcx, [r12 + 1] testeax, eax cmovg r13, rcx cmovs rbx, r12 jne .LBB0_1 (full output for all 4 available at https://www.godbolt.org/z/aWrKbYPTG. Code snippets from trunk, but also repos on 13.2) While ideally gcc would generate the branchless output for the supplied code, if there is some (reasonable) incantation that would cause it to produce branchless output, I'd be happy to have that too.
[Bug middle-end/113585] New: Poor codegen turning int compare into -1,0,1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113585 Bug ID: 113585 Summary: Poor codegen turning int compare into -1,0,1 Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- https://www.godbolt.org/z/Y31xG7EeT These two functions should be equivalent, but comp2() produces better code than comp1() on both arm64 and x86_64 int comp1(int a, int b) { return a == b ? 0 : (a < b ? -1 : 1); } int comp2(int a, int b) { return a < b ? -1 : (a > b ? 1 : 0); } arm64: comp1(int, int): cmp w0, w1 mov w0, -1 csinc w0, w0, wzr, lt cselw0, w0, wzr, ne ret comp2(int, int): cmp w0, w1 csetw0, gt csinv w0, w0, wzr, ge ret x86_64: comp1(int, int): xor eax, eax cmp edi, esi je .L1 setge al movzx eax, al lea eax, [rax-1+rax] .L1: ret comp2(int, int): xor eax, eax cmp edi, esi mov edx, -1 setgal cmovl eax, edx ret In the arm case, I suspect that the perf will be equivalent, although comp1 has an extra instruction, so will pay a size penalty. On x86, comp2 is branchless while comp1 has a branch which could be problematic if not predictable. It seems like there should be a normalization pass to convert these functions (and other equivalent ones) into a single normalized representation so that they get the same codegen. PS: I tried another equivalent function and it produces even worse codegen on x86_64, comparing the same registers twice: int comp3(int a, int b) { return a > b ? 1 : (a == b ? 0 : -1); } comp3(int, int): xor eax, eax cmp edi, esi mov edx, 1 setne al neg eax cmp edi, esi cmovg eax, edx ret
[Bug libstdc++/111588] Provide opt-out of shared_ptr single-threaded optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111588 --- Comment #5 from Mathias Stearn --- Mea culpa. The difference between boost and std was due to the code to fast-path shared_ptrs that aren't actually shared: https://github.com/gcc-mirror/gcc/blob/be34a8b538c0f04b11a428bd1a9340eb19dec13f/libstdc%2B%2B-v3/include/bits/shared_ptr_base.h#L324-L362. I still think that optimization is a good idea, even if it looks bad in this specific microbenchmark. When that is disabled, they have the same perf, even with the check for single-threaded. That said, I'd still love an opt out. For now, I'll just propose that we add a do-nothing bg thread in our benchmark main() to avoid misleading results.
[Bug libstdc++/111588] Provide opt-out of shared_ptr single-threaded optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111588 --- Comment #4 from Mathias Stearn --- So even though I already knew in the back of my mind about how this can affect benchmark results, I *still* got burned by it! I was benchmarking a simple hazard pointer implementation against shared ptrs by having them crab-walk a list of 1000 pointers. This was an admittedly simple and unrealistic benchmark that only ever accessed the objects from a single thread and never had any contention. It was a few hours after getting the initial results that I remembered this issue and went back to add a bg thread. > This needs numbers, not opinions While my biggest concern was misleading benchmarks (which I now feel a bit validated by my own mistake :), here are the numbers I see. I added boost::shared_ptr on the assumption (unvalidated) that the primary difference would be that it unconditionally uses atomics. --- Benchmark Time CPU Iterations --- BM_scanLinks_HazPtr6420 ns 6420 ns 108948 BM_scanLinks_BoostSharedPtr 11223 ns11223 ns62380 BM_scanLinks_StdSharedPtr 3217 ns 3217 ns 217621 BM_scanLinks_AtomicSharedPtr 28920 ns28920 ns24200 And with a bg thread doing nothing but sleeping: --- Benchmark Time CPU Iterations --- BM_scanLinks_HazPtr6887 ns 6887 ns 101445 BM_scanLinks_BoostSharedPtr 11224 ns11224 ns62373 BM_scanLinks_StdSharedPtr 14221 ns14221 ns49221 BM_scanLinks_AtomicSharedPtr 49574 ns49573 ns14126
[Bug other/111976] Large constant zero-valued objects should go in .bss rather than .rodata
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111976 --- Comment #1 from Mathias Stearn --- Just to be clear, I think we should only do this for "large" objects or collections of objects. If you did it for small objects in general, there is a risk that they will spread out mutable data that is written to over more pages, so that you end up with more runtime anonymous pages, rather than read-only pages backed by the file cache, so they end up being more expensive. I think a good rule to prevent this would be to only do it for objects larger than a page, and possibly add page alignment to them. It may also be possible to collect enough smaller objects together to make it worth doing this. Not sure how often that occurs in practice though. Also at -Os it may make sense to do this for any size object putting since small objects in .bss will still shrink the binary size. Not sure how users of -Os would react to tradeoffs involving runtime memory consumption vs binary size.
[Bug other/111976] New: Large constant zero-valued objects should go in .bss rather than .rodata
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111976 Bug ID: 111976 Summary: Large constant zero-valued objects should go in .bss rather than .rodata Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: other Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- Right now constant objects always go in .rodata. This is nice because it will give a nice loud error if you ever write to it. However, .rodata takes up space in the binary file and in memory at runtime. If you instead put it in .bss it takes up no space in the binary file and, at least on linux, it gets backed by a single zero-filled page of physical memory. Ideally there would be something like .robss which gave you the best of both worlds, but this is admittedly niche for all the effort to add a new section like that. I think the best option is to leave it in .rodata for non-optimized builds to catch bugs, but when optimizing, especially with -Os, put it in .bss. Repro https://www.godbolt.org/z/3rWvTrsTv: constexpr int GB = 1024*1024*1024; // Goes in .bss - no space in binary or runtime. char nonconst_zeros[GB] = {}; // Goes in .rodata - expensive in binary size and runtime memory. extern const char const_zeros[GB]; // force usage const char const_zeros[GB] = {}; .globl const_zeros .section.rodata .align 32 .type const_zeros, @object .size const_zeros, 1073741824 const_zeros: .zero 1073741824 .globl nonconst_zeros .bss .align 32 .type nonconst_zeros, @object .size nonconst_zeros, 1073741824 nonconst_zeros: .zero 1073741824
[Bug target/104611] memcmp/strcmp/strncmp can be optimized when the result is tested for [in]equality with 0 on aarch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104611 Mathias Stearn changed: What|Removed |Added CC||redbeard0531 at gmail dot com --- Comment #4 from Mathias Stearn --- clang has already been using the optimized memcmp code since v16, even at -O1: https://www.godbolt.org/z/qEd768TKr. Older versions (at least since v9) were still branch-free, but via a less optimal sequence of instructions. GCC's code gets even more ridiculous at 32 bytes, because it does a branch after every 8-byte compare, while the clang code is fully branch-free (not that branch-free is always better, but it seems clearly so in this case). Judging by the codegen, there seems to be three deficiencies in GCC: 1) an inability to take advantage of the load-pair instructions to load 16-bytes at a time, and 2) an inability to use ccmp to combine comparisons. 3) using branching rather than cset to fill the output register. Ideally these could all be done in the general case by the low level instruction optimizer, but even getting them special cased for memcmp (and friends) would be an improvement.
[Bug libstdc++/111589] New: Use relaxed atomic increment (but not decrement!) in shared_ptr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111589 Bug ID: 111589 Summary: Use relaxed atomic increment (but not decrement!) in shared_ptr Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: libstdc++ Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- The atomic increment when copying a shared_ptr can be relaxed because it is never actually used as a synchronization operation. The current thread must already have sufficient synchronization to access the memory because it can already deref the pointer. All synchronization is done either via whatever program-provided code makes the shared_ptr object available to the thread, or in the atomic decrement (where the decrements to non-zero are releases that ensure all uses of the object happen before the final decrement to zero acquires and destroys the object). As an argument-from-authority, libc++ already is using relaxed for increments and acq/rel for decements: https://github.com/llvm/llvm-project/blob/c649fd34e928ad01951cbff298c5c44853dd41dd/libcxx/include/__memory/shared_ptr.h#L101-L121 This will have no impact on x86 where all atomic RMWs are effectively sequentially consistent, but it will enable the use of ldadd rather than ldaddal on aarch64, and similar optimizations on other weaker architectures.
[Bug libstdc++/111588] New: Provide opt-out of shared_ptr single-threaded optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111588 Bug ID: 111588 Summary: Provide opt-out of shared_ptr single-threaded optimization Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: libstdc++ Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- Right now there is a fast-path for single-threaded programs to avoid the overhead of atomics in shared_ptr, but there is no equivalent for a program the knows it is multi-threaded to remove the check and branch. If __GTHREADS is not defined then no atomic code is emitted. There are two issues with this: 1) for programs that know they are effectively always multithreaded they pay for a runtime branch and .text segment bloat for an optimization that never applies. This may have knock-on effects of making functions that use shared_ptr less likely to be inlined by pushing them slightly over the complexity threshold. 2) It invalidates singlethreaded microbenchmarks of code that uses shared_ptr because the performance of the code may be very different from when run in the real multithreaded program. I understand the value of making a fast mode for single-threaded code, and I can even except having the runtime branch by default, rather than as an opt-in, when it is unknown if the program will be run with multiple threads. But an opt-out would be nice to have. If it had to be a gcc-build time option rather than a #define, that would be acceptable for us since we always use our own build of gcc, but it seems like a worse option for other users. FWIW, neither llvm libc++ (https://github.com/llvm/llvm-project/blob/0bfaed8c612705cfa8c5382d26d8089a0a26386b/libcxx/include/__memory/shared_ptr.h#L103-L110) nor MS-STL (https://github.com/microsoft/STL/blob/main/stl/inc/memory#L1171-L1173) ever use runtime branching to detect multithreading.
[Bug c++/89997] Garbled expression in error message with -fconcepts
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89997 --- Comment #6 from Mathias Stearn --- > I think this is probably not concepts-specific, and just another variant of > PR 49152. Perhaps the busted pretty printer is a general problem, but at least in this case I think the fix may be in concepts code. It looks like the error is generated at https://github.com/gcc-mirror/gcc/blob/16e2427f50c208dfe07d07f18009969502c25dc8/gcc/cp/constraint.cc#L3300 (or the similar call 7 lines lower). Given that gcc already prints the source loc with the invalid expression, I think you can just remove the %qE to improve the diagnostic output. (I don't know the gcc codebase at all, so I could be wrong about this, I just grepped for the string "the required expression")
[Bug c++/89997] Garbled expression in error message with -fconcepts
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89997 --- Comment #3 from Mathias Stearn --- Please reopen. It still seems to be broken with -std=c++20 as the only flag: https://godbolt.org/z/bWMq4s6xb (trunk) https://godbolt.org/z/W3xWjWaGe (12.2) Output: : In function 'void test()': :16:15: error: no matching function for call to 'check()' 16 | check(); // mangled error | ~~^~ :12:6: note: candidate: 'template void check() requires requires(X x, T val) {x.X::operator<<(const char*)("hello") << val;}' 12 | void check() requires requires (X x, T val) { x << "hello" << val; } {} | ^ :12:6: note: template argument deduction/substitution failed: :12:6: note: constraints not satisfied : In substitution of 'template void check() requires requires(X x, T val) {x.X::operator<<(const char*)("hello") << val;} [with T = int]': :16:15: required from here :12:6: required by the constraints of 'template void check() requires requires(X x, T val) {x.X::operator<<(const char*)("hello") << val;}' :12:23: in requirements with 'X x', 'T val' [with T = int] :12:60: note: the required expression '("hello"->x.X::operator<<() << val)' is invalid 12 | void check() requires requires (X x, T val) { x << "hello" << val; } {} | ~^~ cc1plus: note: set '-fconcepts-diagnostics-depth=' to at least 2 for more detail Compiler returned: 1 The last line with still says "the required expression '("hello"->x.X::operator<<() << val)' is invalid". It should not be trying to apply -> to a string literal. The following line with carrot and underline is very helpful and shows what the problem is. But the "note" line seems actively harmful since it is showing an expression that would never be valid for any type. It seems like it would be better to remove that line than attempting to show it if you can't reproduce the correct expression.
[Bug c++/105397] Cannot export module initializer symbols with `-fvisibility=hidden`
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105397 Mathias Stearn changed: What|Removed |Added CC||redbeard0531 at gmail dot com --- Comment #1 from Mathias Stearn --- Perhaps the best option is to default the visibility of the implicit functions to the widest visibility of any function or object in module purview exposed by the TU. The assumption being that if anything is visibile outside the library, then it is expected to be imported from TUs outside the library and that should Just Work. Conversely, if everything is defined as internal visibility, then it is unlikely that this module was intended to be imported from outside of the library, and so it may be desireable to allow different libs to have their own module with the same name. Unfortunately that doesn't give any good indication of what to do for importable units that have an empty module purview (or where everything inside it has TU-local internal linkage). While legal, maybe that isn't a case worth optimizing the Just Works experience for?
[Bug target/105661] New: Comparisons to atomic variables generates less efficient code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105661 Bug ID: 105661 Summary: Comparisons to atomic variables generates less efficient code Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- With normal variables, gcc will generate a nice cmp/js pair when checking if the high bit is null. When using an atomic, gcc generates a movzx/test/js triple, even though it could use the same codegen as for a non-atomic. https://godbolt.org/z/GorvWfrsh #include #include [[gnu::noinline]] void f(); uint8_t plain; std::atomic atomic; void plain_test() { if (plain & 0x80) f(); } void atomic_test() { if (atomic.load(std::memory_order_relaxed) & 0x80) f(); } With both -O2 and -O3 this generates: plain_test(): cmp BYTE PTR plain[rip], 0 js .L4 ret .L4: jmp f() atomic_test(): movzx eax, BYTE PTR atomic[rip] testal, al js .L7 ret .L7: jmp f() ARM64 seems to be hit even harder, but I don't know that platform well enough to know if the non-atomic codegen is valid there https://godbolt.org/z/c3h8Y1dan. It seems likely though, at least for a relaxed load.
[Bug c++/105560] New: Spurious -Wunused-local-typedefs warning on a typedef used on returned type
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105560 Bug ID: 105560 Summary: Spurious -Wunused-local-typedefs warning on a typedef used on returned type Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- https://godbolt.org/z/zcY11ezev #include #include template typename F::Out call(F&& fun) { typename F::Out var = fun(1); return var; } void test() { // Fill builder with bson-encoded elements from view. auto decorateLambda = [](auto&& lambda) { using Lambda = std::remove_reference_t; struct DecoratedLambda : Lambda { using Out = bool; }; return DecoratedLambda{std::forward(lambda)}; }; call(decorateLambda([&](auto&& value) { (void)(value); return true; })); } With -O2 -std=c++20 -Wall: : In lambda function: :15:19: warning: typedef 'using test()DecoratedLambda::Out = bool' locally defined but not used [-Wunused-local-typedefs] 15 | using Out = bool; | ^~~ Compiler returned: 0 However, DecoratedLambda::Out *is* used by call(), both in the signature and in the body. Perhaps the issue is that typedefs in local types that are returned (or are reachable from returned types) shouldn't be considered "local"?
[Bug c++/105534] -Wmaybe-uninitialized cases shouldn't suppress -Wuninitialized warnings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105534 --- Comment #4 from Mathias Stearn --- (In reply to Richard Biener from comment #2) > Note there's > > _2 = value_1(D) * x_2; > > where _2 might be effectively uninitialized iff x_2 is not zero. When x_2 > is zero then _2 has a well-defined value. Not according to the C++ standard. http://eel.is/c++draft/basic.indet seems pretty clear that unless x_2 is std::byte (which doesn't support multiplication) or an "unsigned ordinary character type", then that is UB. FWIW I still think I'd expect the warning on "unsigned char x, y; y = x * 0;", but I would definitiely expect the warning for int.
[Bug c++/105534] -Wmaybe-uninitialized cases shouldn't suppress -Wuninitialized warnings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105534 --- Comment #3 from Mathias Stearn --- One slightly confusing aspect is that the wording of the flag implies that the variable may or may not be uninitialzied (because in -Wmaybe-uninitialized maybe binds directly to uninitialized), while phrasing of the warning message is about the usage being conditional: "may be used uninitialized". And of course the documentation (at least the man page) uses a different phrasing: > For an object with automatic or allocated storage duration, if there exists a > path from the function entry to a use of the object that is initialized, but > there exist some other paths for which the object is not initialized, the > compiler emits a warning if it cannot prove the uninitialized paths are not > executed at run time. For both the initial example with count, and your example with count2, I'd say that the "there exists a path from the function entry to a use of the object that is initialized" bit is clearly not satisfied, so if we assume the documentation is correct, then those cases both lack a "maybe" and the variables are clearly uninitialized. This would also match my intuition for -Winitialized which is that it definitively errors if all paths from declaration to any usage result in the variable being uninitialized. PS - This test case is a reduced example from an actual bug that luckily was found by coverity before release: https://jira.mongodb.org/browse/SERVER-66306. I dug in to make a repro because I was expecting that we would have gotten a compiler error on that code before it was even committed. I'm also exploring whether we can stop passing -Wno-maybe-uninitialized, but it looks like we still get false positives in third-party headers, so it doesn't seem likely.
[Bug c++/105534] New: -Wmaybe-uninitialized shouldn't suppress -Wuninitialized warnings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105534 Bug ID: 105534 Summary: -Wmaybe-uninitialized shouldn't suppress -Wuninitialized warnings Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- The following function emits a -Wuninitialized warning on ++count with -Wall https://godbolt.org/z/KfaMEETY1: int test(bool cond) { int count; ++count; return count; } Making the increment be conditional changes it to a -Wmaybe-uninitialized warning, which is suppressed with -Wno-maybe-uninitialized. https://godbolt.org/z/qarMrqW7E int test(bool cond) { int count; if (cond) ++count; return count; } This makes no sense. count is never initialized on any path through the function, and it is returned on all paths. We use -Wall with -Wno-maybe-uninitialized on our codebase because we were getting too many false-positives with -Wmaybe-initialized, in particular from third-party headers that we didn't want to modify. At the time we decided to do that, we didn't realize that we would also be missing out on clearly uninitialized cases like this.
[Bug target/105496] New: Comparison optimizations result in unnecessary cmp instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105496 Bug ID: 105496 Summary: Comparison optimizations result in unnecessary cmp instructions Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- https://godbolt.org/z/1zdYsaqEj Consider these equivalent functions: int test1(int x) { if (x <= 10) return 123; if (x == 11) return 456; return 789; } int test2(int x) { if (x < 11) return 123; if (x == 11) return 456; return 789; } In test2 it is very clear that you can do a single cmp of x with 11 then use different flag bits to choose your case. In test1 it is less clear, but because x<=10 and x<11 are equivalent, you could always transform one to the other. Clang seems to do this correctly and transforms test1 into test2 and only emits a single cmp instruction in each. For some reason, not only does gcc miss this optimization, it seems to go the other direction and transform test2 into test1, emitting 2 cmp instructions for both! test1(int): mov eax, 123 cmp edi, 10 jle .L1 cmp edi, 11 mov eax, 456 mov edx, 789 cmovne eax, edx .L1: ret test2(int): mov eax, 123 cmp edi, 10 jle .L6 cmp edi, 11 mov eax, 456 mov edx, 789 cmovne eax, edx .L6: ret Observed with at least -O2 and -O3. I initially observed this for code where each if generated an actual branch rather than a cmov, but when I reduced the example, the cmov was generated. I'm not sure if this should be a middle-end or target specific optimization, since ideally it would be smart on all targets that use comparison flags, even if there are some targets that don't. Is there ever a down side to trying to make two adjacent comparisons compare the same number?
[Bug libstdc++/104223] New: GCC unable to inline trivial functions passed to views::filter and transform unless lifted into types
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104223 Bug ID: 104223 Summary: GCC unable to inline trivial functions passed to views::filter and transform unless lifted into types Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: libstdc++ Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- I'm not sure if this is an issue with the optimizer or the way that the library code is written, or some combination of the two, but the end result seems unfortunate. with() and without() are logically the same function, the only difference is that with() lifts the function pointers into types using a templated lambda variable, while without() just passes the function names directly to the library. It seems interesting that the optimizer knows that they are "constant enough" to emit direct rather than indirect calls to t() and f(), however, it isn't constant enough to inline those calls. https://godbolt.org/z/EqWzzh916 Flags: -std=c++2b -O3 Reproduces on at least 11.2 and trunk #include namespace views = std::views; void trace() noexcept; inline int f(int i) { trace(); return i; } inline bool t(int) { return true; } // for some reason gcc needs this in order to inline f() and t() template auto typify = [] (int i) { return f(i); }; void with() { for (auto&& x : views::single(1) | views::transform(typify) | views::filter(typify)) {} } void without() { for (auto&& x : views::single(1) | views::transform(f) | views::filter(t)) {} } with(): sub rsp, 8 calltrace() add rsp, 8 jmp trace() without(): sub rsp, 8 mov edi, 1 callf(int) mov edi, eax callt(int) testal, al jne .L10 add rsp, 8 ret .L10: mov edi, 1 add rsp, 8 jmp f(int)
[Bug c++/102876] GCC fails to use constant initialization even when it knows the value to initialize
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102876 --- Comment #12 from Mathias Stearn --- (In reply to Jakub Jelinek from comment #10) > So we'd just punt at optimizing that, we don't know if b is read or written > by foo (and, note, it doesn't have to be just the case of explicitly being > passed address of some var, it can get the address through other means). > On the other side, we can't optimize b to b: .long 2, because bar can use > the variable and/or modify it, so by using 2 as static initializer bar would > be miscompiled. I'm pretty sure that that is explicitly allowed by https://eel.is/c++draft/basic.start.static#3, so it should *not* be considered a miscompilation. The example only shows reading from another static-duration variable's initializer, but I believe writing would also be covered. I took a look at how other compilers handle this, and it is somewhat interesting: https://godbolt.org/z/9YvcbEeax int foo(int&); inline int bar() { return 7; } extern int b; int z = bar(); int a = foo(b); // comment this out and watch clang change... int b = bar(); int c = bar(); GCC: always does dynamic init for everything. MSVC: always does static init for z, b, and c. always dynamic init for a. Clang: it seems like if it does any dynamic init in the TU, it doesn't promote any dynamic init to static. So with that code all four variables are dynamicially initialized, but if you comment out a, the remaining 3 become static. If you add an unrelated variable that requires dynamic init, those 3 become dynamically initialized again. I don't understand why clang does what it does. I don't think it is required to do that by the standard, and it clearly seems suboptimal. So I would rather GCC behave like MSVC in this case than like clang. Also note what happens if we provide a definition for foo like `inline int foo(int& x) { return x += 6; }`: https://godbolt.org/z/sWd6chsnP. Now both MSVC and Clang will static initialize z, b, and c to 7 and *static* initialize a to 6. GCC gets the same result dynamically, but for some reason tries to load b prior to adding 6, even though it has to be 0 (barring a UB write to b from another TU).
[Bug c++/102876] GCC fails to use constant initialization even when it knows the value to initialize
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102876 --- Comment #3 from Mathias Stearn --- > Why not just make the function constexpr though? That isn't always possible. Sometimes the initializer may call a third-party function that is inline, but not yet marked constexpr (it may need to support older language versions where it couldn't be constexpr). Other times the initializer may call a function that is out of line (so can't be constexpr at all), but defined in the same TU. MSVC and clang handle this somewhat more realistic example nicely, gcc doesn't: https://godbolt.org/z/jYKx8359T The original example using chrono was just something that when reading I thought "any optimizer worth its salt should be able to do this even without explicit constexpr annotation". I was disappointed to learn that gcc couldn't, so I filed a bug in the hope that it can be improved.
[Bug c++/102876] New: GCC fails to use constant initialization even when it knows the value to initialize
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102876 Bug ID: 102876 Summary: GCC fails to use constant initialization even when it knows the value to initialize Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- See: https://godbolt.org/z/KnKv9j8b9 #include using namespace std::literals; /*not constexpr*/ inline std::chrono::sys_days compute_days() { return 1776y/7/4; } std::chrono::sys_days BURN_IN_TO_BINARY = compute_days(); Clang and MSVC both correctly burn BURN_IN_TO_BINARY into the binary image with the correct value. Even with -O3, gcc zero-initializes it then uses a dynamic initializer to complete the initialization. Both are valid implementation strategies according to https://eel.is/c++draft/basic.start.static#3, however, I think the strategy used by clang and MSVC is clearly superior QoI here.
[Bug ipa/102528] Unused out-of-line functions emitted for trivial coroutines
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102528 --- Comment #8 from Mathias Stearn --- Sorry again about the confusion caused by my typo. I am not able to edit the comment to make it clear that the comment#0 should be ignored. If that happens again, would it be better for me to close the ticket and create a new one?
[Bug c++/102528] Unused out-of-line functions emitted for trivial coroutines
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102528 --- Comment #1 from Mathias Stearn --- Sorry, there was a typo in the initial code. I forgot the trivial implementation of await_resume(). (D'oh!) Now I can see that test() is just a ret instruction, but there is still a lot of dead code emitted for the coroutine functions. While it isn't too much for the trivial functions, for a real use case it would be. And it seems like a bug anyway that unused code makes it to the obj file in -O3: https://godbolt.org/z/bTncofrzd #include struct simple { struct promise_type { void return_void() {} std::suspend_never initial_suspend() { return {}; } std::suspend_never final_suspend() noexcept { return {}; } void unhandled_exception() { throw; } simple get_return_object() { return {}; } }; std::true_type await_ready() {return {}; } void await_suspend(std::coroutine_handle<>) {} void await_resume() {} }; inline simple test1() { co_return; } inline simple test2() { co_await test1(); co_return; } void test() { test2(); } test1(test1()::_Z5test1v.frame*) [clone .actor]: movzx eax, WORD PTR [rdi+18] testal, 1 je .L2 cmp ax, 5 ja .L3 mov edx, 42 bt rdx, rax jnc .L3 .L4: cmp BYTE PTR [rdi+17], 0 jne .L14 .L1: ret .L2: cmp ax, 2 je .L5 cmp ax, 4 je .L4 testax, ax jne .L3 mov QWORD PTR [rdi+24], rdi .L5: cmp BYTE PTR [rdi+17], 0 mov BYTE PTR [rdi+32], 1 mov QWORD PTR [rdi], 0 je .L1 .L14: jmp operator delete(void*) test1(test1()::_Z5test1v.frame*) [clone .actor] [clone .cold]: .L3: ud2 test2(test2()::_Z5test2v.frame*) [clone .actor]: movzx eax, WORD PTR [rdi+18] testal, 1 je .L16 cmp ax, 7 ja .L17 mov edx, 170 bt rdx, rax jnc .L17 .L18: cmp BYTE PTR [rdi+17], 0 jne .L33 .L15: ret .L16: cmp ax, 4 je .L19 ja .L20 testax, ax jne .L34 mov QWORD PTR [rdi+24], rdi .L22: mov BYTE PTR [rdi+32], 1 .L19: cmp BYTE PTR [rdi+17], 0 mov QWORD PTR [rdi], 0 je .L15 .L33: jmp operator delete(void*) .L34: cmp ax, 2 je .L22 jmp .L17 .L20: cmp ax, 6 je .L18 jmp .L17 test2(test2()::_Z5test2v.frame*) [clone .actor] [clone .cold]: .L17: ud2 test1(test1()::_Z5test1v.frame*) [clone .destroy]: movzx eax, WORD PTR [rdi+18] or eax, 1 mov WORD PTR [rdi+18], ax cmp ax, 5 ja .L36 mov edx, 42 bt rdx, rax jnc .L36 cmp BYTE PTR [rdi+17], 0 jne .L39 ret .L39: jmp operator delete(void*) test1(test1()::_Z5test1v.frame*) [clone .destroy] [clone .cold]: .L36: ud2 test2(test2()::_Z5test2v.frame*) [clone .destroy]: movzx eax, WORD PTR [rdi+18] or eax, 1 mov WORD PTR [rdi+18], ax cmp ax, 7 ja .L41 mov edx, 170 bt rdx, rax jnc .L41 cmp BYTE PTR [rdi+17], 0 jne .L44 ret .L44: jmp operator delete(void*) test2(test2()::_Z5test2v.frame*) [clone .destroy] [clone .cold]: .L41: ud2 test(): ret
[Bug c++/102528] New: Unable to inline even trivial coroutines
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102528 Bug ID: 102528 Summary: Unable to inline even trivial coroutines Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- https://godbolt.org/z/aoab9W4xG This should all compile away, and test() should just be a single ret instruction. That is not what happens now, even with -O3. #include struct simple { struct promise_type { void return_void() {} std::suspend_never initial_suspend() { return {}; } std::suspend_never final_suspend() noexcept { return {}; } void unhandled_exception() { throw; } simple get_return_object() { return {}; } }; std::true_type await_ready() { return {}; } void await_suspend(std::coroutine_handle<>) {} void await_resume(); }; inline simple test1() { co_return; } inline simple test2() { co_await test1(); co_return; } void test() { test2(); }
[Bug c++/100493] Lambda default copy capture that captures "this" cannot be used in both C++17 and C++20 modes
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100493 --- Comment #3 from Mathias Stearn --- When reading https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82611, I noticed that C++17 actually requires the warning on [=, this] from a conforming implementation: https://timsong-cpp.github.io/cppwp/n4659/expr.prim.lambda.capture#2. However, given that this requirement was lifted in C++20, would it be possible to only warn about that in -std=c++17 (or lower) with -pedantic? It seems counterproductive to warn about it without -pedantic.
[Bug c++/100493] Lambda default copy capture that captures "this" cannot be used in both C++17 and C++20 modes
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100493 Mathias Stearn changed: What|Removed |Added CC||redbeard0531 at gmail dot com --- Comment #2 from Mathias Stearn --- This is making it harder for us to transition to -std=c++20, since we can't have the same code be warning-clean in both versions. I really don't think the warning for [=, this] is helpful given that it is the pattern that is now recommended.
[Bug web/102365] New: Function attribute docs should have an anchor or id on each attribute.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102365 Bug ID: 102365 Summary: Function attribute docs should have an anchor or id on each attribute. Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: web Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- https://gcc.gnu.org/onlinedocs/gcc-4.7.2/gcc/Function-Attributes.html This would make it easier to link to the docs for a specific attribute. There is currently an anchor on the tag (eg https://gcc.gnu.org/onlinedocs/gcc-4.7.2/gcc/Function-Attributes.html#index-g_t_0040code_007bartificial_007d-function-attribute-2500 for artificial), but that will scroll down so that the name of the attribute isn't shown. Ideally the anchor should be on the tag instead. A nice enhancement would be to make the attribute name be a self-link to make it easy to get the link without needing to poke in the browser's inspect facility.
[Bug c++/102363] New: source_location in await_transform has function_name referring to internal coroutine funclet rather than source-level function
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102363 Bug ID: 102363 Summary: source_location in await_transform has function_name referring to internal coroutine funclet rather than source-level function Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- When a promise_type::await_transform() has a defaulted argument of source_location::current(), it refers to the location of the co_await that triggered the await_transform. This is really useful for building runtime diagnostics and tracing. The file/line/column fields all seem to be filled out correctly, however, the function_name field refers to an internal funclet that the coroutine function is split into. I think it would be more useful if it had the original corountine function name since that is what users will be expecting. Example: https://godbolt.org/z/sbMsPG8Eq Output: good: /app/example.cpp:20:43: example func() bad: /app/example.cpp:25:14: void func(func()::_Z4funcv.frame*) Source (compiled with just -std=c++20): #include #include #include struct example { struct promise_type { std::suspend_never initial_suspend() { return {}; } std::suspend_never final_suspend() noexcept { return {}; } void unhandled_exception() { throw; } example get_return_object() { return {}; } void return_void() {} auto await_transform(const char* dummy, std::source_location l = std::source_location::current()) { printf("bad: %s:%u:%u: %s\n", l.file_name(), l.line(), l.column(), l.function_name()); return std::suspend_never(); } }; }; example func() { auto l = std::source_location::current(); printf("good: %s:%u:%u: %s\n", l.file_name(), l.line(), l.column(), l.function_name()); // bad location points here: // v co_await "pretend this is an awaitable"; } int main() { func(); }
[Bug c++/100876] New: -Wmismatched-new-delete should either look through or ignore placement new
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100876 Bug ID: 100876 Summary: -Wmismatched-new-delete should either look through or ignore placement new Product: gcc Version: 11.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- https://godbolt.org/z/KTTMrEGns Example code: free(new (malloc(4)) int()); // Warns but shouldn't delete new (malloc(4)) int(); // Doesn't warn but should output: :5:9: warning: 'void free(void*)' called on pointer returned from a mismatched allocation function [-Wmismatched-new-delete] 5 | free(new (malloc(4)) int()); // Warns but shouldn't | ^~~ :5:30: note: returned from 'void* operator new(std::size_t, void*)' 5 | free(new (malloc(4)) int()); // Warns but shouldn't | ^ While it would be nice to have a warning on the second line, not warning on the first seems more important. And hopefully is a backportable fix. Here is some Real World Code exhibiting this pattern that g++ currently warns about when compiling: https://github.com/facebook/hermes/blob/dfef1abd6d20b196e24c591e225a7003e6337a94/unittests/VMRuntime/StringPrimitiveTest.cpp#L221-L235. There is also an example using calloc() lower in that file.
[Bug libstdc++/99979] New: condition_variable_any has wrong behavior if Lock::lock() throws
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99979 Bug ID: 99979 Summary: condition_variable_any has wrong behavior if Lock::lock() throws Product: gcc Version: 11.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: libstdc++ Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- https://eel.is/c++draft/thread.condvarany.wait says "Postconditions: lock is locked by the calling thread. […] Remarks: If the function fails to meet the postcondition, terminate() is invoked […] This can happen if the re-locking of the mutex throws an exception." This wording was changed in C++14 for https://cplusplus.github.io/LWG/issue2135, but given that it is borderline impossible to use a condition_variable[_any] correctly if it doesn't guarantee relocking, it seems best to do the same in C++11 mode as well. Unfortunately, std::condition_variable_any::__Unlock::~__Unlock() either ignores[1] or propagates any exception thrown by lock(), which means that the postcondition of wait() may not be fulfilled. I think the correct behavior would be to remove the noexcept(false), allowing the destructor's default noexecpt to apply, and just unconditionally call _M_lock.lock() without any try/catch. [1] It ignores if std::uncaught_exception() returns true. I'm not 100% sure why that dispatch is there. It seems like it may be trying to detect if _M_cond.wait() threw, but a) that is noexcept and b) that doesn't work correctly if some thread waits on a condvar inside a destructor that is triggered during exception unwinding (this is why std::uncaught_exceptions() was introduced).
[Bug rtl-optimization/99470] Convert fixed index addition to array address offset
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99470 Mathias Stearn changed: What|Removed |Added Status|RESOLVED|UNCONFIRMED Resolution|INVALID |--- --- Comment #4 from Mathias Stearn --- Yes, but I believe any case where they would access different addresses would be UB overflow in f(), making it valid to turn f() into g(), especially if you used an internal lowering which sign extended index to pointer width and had defined wrapping semantics. I'll note that clang already generates identical code for f() and g() https://gcc.godbolt.org/z/j897sh, although I think gcc has better codegen at least for g(). Also, my example was perhaps oversimplified. My indexes were actually int8_t (which is why I'm indexing into a 256-element array), so due to int promotion, overflow is actually impossible. However, with int8_t arguments, gcc generates even worse code for f(), doing the sign-extension twice for some reason (8 -> 32 -> 64): https://gcc.godbolt.org/z/5r9h89 (I hope it isn't a faux pas to reopen the ticket, but I think I've provided enough new information that this warrants another look)
[Bug c++/99470] New: Convert fixed index addition to array address offset
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99470 Bug ID: 99470 Summary: Convert fixed index addition to array address offset Product: gcc Version: 11.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- These two functions do the same thing but f() is the cleaner source code (especially when arr is a std::array) while g() generates better code: https://gcc.godbolt.org/z/vTT399 #include inline int8_t arr[256]; bool f(int a, int b) { return arr[a+128] == arr[b+128]; } bool g(int a, int b) { return (arr+128)[a] == (arr+128)[b]; } f(int, int): sub edi, -128 sub esi, -128 lea rax, arr[rip] movsx rdi, edi movsx rsi, esi movzx edx, BYTE PTR [rax+rsi] cmp BYTE PTR [rax+rdi], dl seteal ret g(int, int): lea rax, arr[rip+128] movsx rdi, edi movsx rsi, esi movzx edx, BYTE PTR [rax+rsi] cmp BYTE PTR [rax+rdi], dl seteal ret In addition to only doing the +128 once, it also ends up being completely free in g() because the assembler (or linker?) folds the addition into the address calculation by adjusting the offset of the rip-relative address. In the godbolt link, you can see that when compiled to binary, the LEA instruction uses the same form in both f() and g(), so the addition really is free in g().
[Bug middle-end/99339] Poor codegen with simple varargs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99339 --- Comment #6 from Mathias Stearn --- > The question is how common in the wild it is and if it is worth the work. I'm not sure how common it is, but this is my use case. The code in the bug report is a slightly simplified example from some Real World Code I am working on: https://source.wiredtiger.com/develop/struct_w_t___c_u_r_s_o_r.html#af19f6f9d9c7fc248ab38879032620b2f. Essentially WT_CURSOR is a structure of function pointers that all take a WT_CURSOR pointer as the first argument, similar to C++ virtual functions. The get_key() and get_value() "methods" both take (WT_CURSOR*, ...) arguments, and unpack the arguments based on the format string that was set up when you opened the cursor. The expectation is that they will be called many times for each cursor object. Since we know the format string when creating the cursor I was experimenting with creating specialized functions for common formats rather than dispatching through the generic format-inspecting logic every time. I noticed that I was able to get even more performance by declaring the functions as taking the arguments they actually take rather than dealing with va_args, then casting the function pointers to the generic (WT_CURSOR, ...) type when assigning into the method slot. I assume this is UB in C (I don't know the C standard nearly as well as C++) but all ABIs I know of are designed to make this kind of thing work. I'd rather not have to do that kind of questionable shenanigans in order to get the same performance.
[Bug libstdc++/95079] unorderd_map::insert_or_assign and try_emplace should only hash and mod once unless there is a rehash.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95079 --- Comment #5 from Mathias Stearn --- @François Dumont: Sorry I didn't see your question earlier. The reason that unordered_map perf hurts on 64-bit platforms is because it is designed to do a size_t modulus-by-prime on every lookup, and on most platforms that is *very* expensive (up to 100 cycles for 64 bits vs 20ish for 32 bits). Some very modern CPUs have made improvements here, but it is still much more expensive than just using power-of-2 buckets and masking, even if you need to hash the hash if you don't trust the low order bits to have enough entropy. Unfortunately, fixing this is a pretty big ABI break, so it isn't going to change any time soon.
[Bug middle-end/99339] Poor codegen with simple varargs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99339 --- Comment #5 from Mathias Stearn --- I filed a related bug with clang right after this one if anyone want to follow along https://bugs.llvm.org/show_bug.cgi?id=49395. Just because clang does worse doesn't mean gcc shouldn't do better ;)
[Bug c/99339] New: Poor codegen with simple varargs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99339 Bug ID: 99339 Summary: Poor codegen with simple varargs Product: gcc Version: 11.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- These two functions should generate the same code, but the varargs version is much worse. And it is even worser still when you enable -fstack-protector-strong. https://godbolt.org/z/noEYoh #include int test_va(int x, ...) { int i; va_list va; va_start(va, x); i = va_arg(va, int); va_end(va); return i + x; } int test_args(int x, int i) { return i + x; } # explicit args with or without stack protection test_args: lea eax, [rsi+rdi] ret # without stack protection (why aren't dead stores to the stack being eliminated?) test_va: lea rax, [rsp+8] mov QWORD PTR [rsp-40], rsi mov QWORD PTR [rsp-64], rax lea rax, [rsp-48] mov QWORD PTR [rsp-56], rax mov eax, DWORD PTR [rsp-40] mov DWORD PTR [rsp-72], 8 add eax, edi ret # with stack protection (yikes!) test_va: sub rsp, 88 mov QWORD PTR [rsp+40], rsi mov rax, QWORD PTR fs:40 mov QWORD PTR [rsp+24], rax xor eax, eax lea rax, [rsp+96] mov DWORD PTR [rsp], 8 mov QWORD PTR [rsp+8], rax lea rax, [rsp+32] mov QWORD PTR [rsp+16], rax mov eax, DWORD PTR [rsp+40] add eax, edi mov rdx, QWORD PTR [rsp+24] sub rdx, QWORD PTR fs:40 jne .L7 add rsp, 88 ret .L7: call__stack_chk_fail
[Bug c++/99047] New: ICE on simple task coroutine example
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99047 Bug ID: 99047 Summary: ICE on simple task coroutine example Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- https://godbolt.org/z/ecxEbb Code compiles on MSVC and clang, and at least on clang produces the expected output. #include #include template struct [[nodiscard]] task { struct promise_type { std::suspend_always initial_suspend() { return {}; } auto final_suspend() noexcept { struct awaiter { std::false_type await_ready() noexcept { return {}; } std::coroutine_handle<> await_suspend(std::coroutine_handle<>) noexcept { return next; } void await_resume() noexcept { } std::coroutine_handle<> next; }; return awaiter{next}; } void unhandled_exception() noexcept { std::terminate(); } auto get_return_object() { return task(this); } auto coro() { return std::coroutine_handle::from_promise(*this); } void return_value(T val) { result.emplace(std::move(val)); } std::coroutine_handle<> next; std::optional result; }; task(task&& source) : p(std::exchange(source.p, nullptr)) {} explicit task(promise_type* p) : p(p) {} ~task() { if (p) p->coro().destroy(); } bool await_ready() noexcept { return p->coro().done(); } std::coroutine_handle<> await_suspend(std::coroutine_handle<> next) noexcept { p->next = next; return p->coro(); } const T& await_resume() const& noexcept { return *p->result; } promise_type* p; }; task five() { co_return 5; } task six() { co_return (co_await five()) + 1; } int main() { auto task = six(); task.p->next = std::noop_coroutine(); task.p->coro().resume(); return *task.p->result; } : In function 'void _Z4fivev.actor(five()::_Z4fivev.frame*)': :92:11: internal compiler error: in fold_convert_loc, at fold-const.c:2430 92 | task five() { | ^~~~ 0x1ce6f09 internal_error(char const*, ...) ???:0 0x6b6f43 fancy_abort(char const*, int, char const*) ???:0 0xc92cd4 fold_convert_loc(unsigned int, tree_node*, tree_node*) ???:0 0xd126ae gimple_boolify(tree_node*) ???:0 0xd1b720 gimplify_expr(tree_node**, gimple**, gimple**, bool (*)(tree_node*), int) ???:0 0xd1bb4a gimplify_expr(tree_node**, gimple**, gimple**, bool (*)(tree_node*), int) [snipping many more frames]
[Bug libstdc++/99021] coroutine_handle<_Promise>::from_address() should be noexcept
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99021 --- Comment #3 from Mathias Stearn --- Thanks for the quick fix! Any chance of a backport of this and https://github.com/gcc-mirror/gcc/commit/f1b6e46c417224887c2f21baa6d4c538a25fe9fb#diff-ed08df78eba81189642b4e8d670a0adb4b377db6846aecb090b4dce52a9251fa to the v10 branch? It will improve the experience of anyone using clang-based editor tooling when using libstdc++ rather than libc++.
[Bug libstdc++/99021] coroutine_handle<_Promise>::from_address() should be noexcept
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99021 --- Comment #1 from Mathias Stearn --- clang bug for reference https://bugs.llvm.org/show_bug.cgi?id=49109
[Bug libstdc++/99021] New: coroutine_handle<_Promise>::from_address() should be noexcept
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99021 Bug ID: 99021 Summary: coroutine_handle<_Promise>::from_address() should be noexcept Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: libstdc++ Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- While it isn't required by the standard, it seems odd that coroutine_handle::from_address() is noexcept while the non-specialized version isn't when they are the same code. This would also help with clang/libstdc++ compatibility because they currently check for noexcept of from_address() as part of validating https://eel.is/c++draft/dcl.fct.def.coroutine#15 (I'll be filing a bug with them about that shortly since that is leaking an internal compiler implementation detail).
[Bug c++/98639] GCC accepts cast from Base to Derived in C++20 mode
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98639 Mathias Stearn changed: What|Removed |Added CC||redbeard0531 at gmail dot com --- Comment #4 from Mathias Stearn --- This really seems like it *is* supposed to be allowed based on simply following the normal language rules for aggregates. It would probably require a special case rule to prevent it from working. Is there something that leads you to think that such a special case rule was forgotten, rather than the natural outcome being expected?
[Bug libstdc++/95079] New: unorderd_map::insert_or_assign and try_emplace should only hash and mod once unless there is a rehash.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95079 Bug ID: 95079 Summary: unorderd_map::insert_or_assign and try_emplace should only hash and mod once unless there is a rehash. Product: gcc Version: 10.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: libstdc++ Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- Currently insert_or_assign() and try_emplace() call find(key) and fall back to emplace(...) if that fails to find the key. The computed hash (and more importantly in general) the modded bucket index computed by find() is thrown away and recomputed by emplace(). Instead they should compute the hash once, and unless there is a rehash, also only do the modulus once. This optimization is already performed for operator[]. https://godbolt.org/z/cw82RC shows that the hasher is invoked once for operator[] and twice for insert_or_assign(). http://quick-bench.com/ge8Suq7PcdRKm6IBQbjvwuXhW6Y shows that there is a significant performance difference (20% in this test). (I know std::unordered_map is always going to be less than fast on 64-bit platforms, but it doesn't need to be slower than it needs to be...)
[Bug libstdc++/94854] New: Comment in basic_string.tcc incorrectly says std::string doesn't have explicit instantiation in C++17
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94854 Bug ID: 94854 Summary: Comment in basic_string.tcc incorrectly says std::string doesn't have explicit instantiation in C++17 Product: gcc Version: 9.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: libstdc++ Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- https://github.com/gcc-mirror/gcc/blob/0c8217b16f307c3eedce8f22354714938613f701/libstdc%2B%2B-v3/include/bits/basic_string.tcc#L1612-L1617 Looks like the code changed in this commit https://github.com/gcc-mirror/gcc/commit/1a289fa36294627c252492e4c18d7877a7c80dc1#diff-56639139bdefbe09b8f41c465ebf1ab5, but the comment wasn't adjusted to match. While I recognize this isn't an issue for 99.9% of users, hopefully if this is fixed it will save someone else that minute or two of staring at the code trying to reconcile it with the comment.
[Bug libstdc++/90295] Please define ~exception_ptr inline
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90295 Mathias Stearn changed: What|Removed |Added CC||redbeard0531 at gmail dot com --- Comment #2 from Mathias Stearn --- This will become more of a problem with C++ coroutines where in many cases the compiler can prove that no exception will be thrown, but the generator or task type will still have overhead from an exception_ptr member that the compiler can't eliminate at the moment. A handful of additional methods should probably either be completely inline or have a nullptr fast-path: https://godbolt.org/z/zP6DSJ default ctor copy ctor copy assign swap operator== operator!= Example showing significant codegen at -O3 which should all optimize away: #include bool test() { std::exception_ptr p1; std::exception_ptr p2 (p1); p1 = p2; swap(p1, p2); return p1 == nullptr && nullptr == p2 && p1 == p2; }
[Bug c++/91759] New: g++ accepts ill-formed deduction guides in wrong scope
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91759 Bug ID: 91759 Summary: g++ accepts ill-formed deduction guides in wrong scope Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- namespace N { template struct X{ X(int); }; } using N::X; X(int) -> X; This is supposed to not compile due to http://eel.is/c++draft/temp.deduct.guide#3.sentence-4: A deduction-guide shall be declared in the same scope as the corresponding class template and, for a member class template, with the same access. clang, icc, and msvc all correctly consider this to be ill-formed: https://godbolt.org/z/UH3Y8W.
[Bug c++/90287] New: [concepts] bogus error on overload failure inside requires-expression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90287 Bug ID: 90287 Summary: [concepts] bogus error on overload failure inside requires-expression Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- https://godbolt.org/z/30Cf6s #include template constexpr inline bool isAddable = requires(const T& lhs, const U& rhs) { lhs + rhs; }; auto x = isAddable; : In instantiation of 'constexpr const bool isAddable >': :13:10: required from here :10:10: error: no match for 'operator+' (operand types are 'const int' and 'const std::__cxx11::basic_string') 10 | lhs + rhs ; | ^ [snip rest of wall-o-errors] If I make isAddable a concept, it correctly evaluates to false, but I would expect requires-expressions to also work when assigned to a constexpr bool.
[Bug c++/90033] New: [concepts] ICE segfault evaluating a requires clause that transitively depends on itself
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90033 Bug ID: 90033 Summary: [concepts] ICE segfault evaluating a requires clause that transitively depends on itself Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- https://godbolt.org/z/SEfFol This is a creduce'd example that tripped a segfault in our Real World Code implementation of unique_function. The RWC version includes a requirement that X != Y so this can never be a copy or move constructor, but that was removed in the reduction. FWIW, The clang concepts fork compiles this successfully. template struct bool_constant { static constexpr bool value = B; }; template struct is_constructible : bool_constant<__is_constructible(T, Args...)> {}; template T&& move(T&); struct X { template requires(is_constructible::value) X(OtherFunc &&); X() = default; }; X source; X dest = move(source); - : In substitution of 'template requires is_constructible::value X::X(OtherFunc&&) [with OtherFunc = X]': :16:21: required from here :16:21: internal compiler error: Segmentation fault 16 | X dest = move(source); | ^ Please submit a full bug report, with preprocessed source if appropriate. See <https://gcc.gnu.org/bugs/> for instructions. I know concepts are still experimental, but if the fix turns out to be simple, we'd appreciate a backport to gcc8.
[Bug c++/90031] New: Bogus parse error trying to explicitly specialize a template variable inside class scope
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90031 Bug ID: 90031 Summary: Bogus parse error trying to explicitly specialize a template variable inside class scope Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- https://godbolt.org/z/y5GQZd struct Struct { template constexpr static bool use_cond = false; template constexpr static bool use_cond = true; }; :5:27: error: explicit template argument list not allowed 5 | constexpr static bool use_cond = true; | ^
[Bug c++/90019] New: [8 regression] Bogus ambiguous overload error for NTTP pack of disjoint enable_ifs unless there is an unsupplied default argument
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90019 Bug ID: 90019 Summary: [8 regression] Bogus ambiguous overload error for NTTP pack of disjoint enable_ifs unless there is an unsupplied default argument Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- This code compiles fine with 7 and trunk but fails with gcc8: https://godbolt.org/z/v1HN8B #include // gcc8 thinks these are ambiguous for <0> template ...> void foo(){} template ...> void foo(){} // but somehow these arn't for <0>, but are for <0,0> !? template ...> void bar(){} template ...> void bar(){} void test() { bar<0>(); // works bar<0,0>(); // boom foo<0>(); // boom }
[Bug c++/89997] New: Garbled expression in error message with -fconcepts
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89997 Bug ID: 89997 Summary: Garbled expression in error message with -fconcepts Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- Usually -fconcepts delivers excellent error messages, but this one is pretty bad. It looks like this goes back to 6.2, when it first started to show the expression. https://godbolt.org/z/m9DlOZ struct Y; struct X { Y operator<< (const char*); }; struct Y { X operator<< (void*); }; template void check() requires requires (X x, T val) { x << "hello" << val; } {} void test() { check(); // no error check(); // mangled error } - : In function 'void test()': :16:16: error: cannot call function 'void check() requires (> [with T = int]' 16 | check(); // mangled error |^ :12:6: note: constraints not satisfied 12 | void check() requires requires (X x, T val) { x << "hello" << val; } {} | ^ :12:6: note: with 'X x' :12:6: note: with 'int val' :12:6: note: the required expression '("hello"->x.X::operator<<() << val)' would be ill-formed What is that expression? How did it end up applying -> to a string literal!?
[Bug c++/89781] Misleading error messages when initializing a static data member in-class
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89781 --- Comment #3 from Mathias Stearn --- Yeah, my point wasn't that this code should be accepted, it was that the error messages should be improved. Ideally there would even be fixits suggesting adding constexpr where it would be valid, otherwise suggesting inline. Sorry if that wasn't clear.
[Bug c++/89781] New: Misleading error messages when initializing a static data member in-class
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89781 Bug ID: 89781 Summary: Misleading error messages when initializing a static data member in-class Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- The error messages make sense in a pre-c++17 world before inline variables existed. Now they are lies! struct X{}; struct Y { static X x = {}; }; :2:21: error: 'constexpr' needed for in-class initialization of static data member 'X Y::x' of non-integral type [-fpermissive] struct Y { static X x = {}; }; ^ The error is even worse when X can't be a constexpr variable: struct X{ X(); }; struct Y { static X x = {}; }; :2:21: error: in-class initialization of static data member 'X Y::x' of non-literal type struct Y { static X x = {}; }; ^ :2:26: error: temporary of non-literal type 'X' in a constant expression struct Y { static X x = {}; }; ^ :1:8: note: 'X' is not literal because: struct X{ X(); }; ^ :1:8: note: 'X' is not an aggregate, does not have a trivial default constructor, and has no 'constexpr' constructor that is not a copy or move constructor This compiles just fine: struct X{ X(); }; struct Y { static inline X x = {}; }; All examples were passing -std=c++17.
[Bug c++/89780] New: -Wpessimizing-move is too agressive with templates and recommends pessimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89780 Bug ID: 89780 Summary: -Wpessimizing-move is too agressive with templates and recommends pessimization Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- https://godbolt.org/z/JA-0Gq #include struct Dest { Dest() = default; Dest(Dest&&); Dest(const Dest&); }; struct Source : Dest {}; template Dest withMove() { T x; return std::move(x); } template Dest noMove() { T x; return x; } template Dest withMove(); template Dest withMove(); template Dest noMove(); template Dest noMove(); > g++ -O3 -Wall -std=c++17 : In instantiation of 'Dest withMove() [with T = Dest]': :24:30: required from here :13:23: warning: moving a local object in a return statement prevents copy elision [-Wpessimizing-move] 13 | return std::move(x); | ^ :13:23: note: remove 'std::move' call Basically, gcc9 recommends changing withMove, where both Source and Dest are moved from, in to noMove, where Dest is copy-elided but Source is copied. While this is a minor optimization for the Dest instantiation, it is a potentially significant pesimization for the Source one. Amusingly, this code trips a warning in clang that recommends doing the opposite, adding the std::move to turn noMove into withMove: https://godbolt.org/z/WONlMN :20:12: warning: local variable 'x' will be copied despite being returned by name [-Wreturn-std-move] return x; ^ :28:15: note: in instantiation of function template specialization 'noMove' requested here template Dest noMove(); ^ :20:12: note: call 'std::move' explicitly to avoid copying return x; ^ std::move(x) There is just no winning!
[Bug middle-end/89739] New: pessimizing vectorization at -O3 to load two u64 objects
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89739 Bug ID: 89739 Summary: pessimizing vectorization at -O3 to load two u64 objects Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- https://godbolt.org/z/8vIGZ3 using u64 = unsigned long long; struct u128 {u64 a, b;}; inline u64 load8(void* ptr) { u64 out; __builtin_memcpy(&out, ptr, 8); return out; } u128 load(char* basep, u64 n) { return {load8(basep), load8(basep+n-8)}; } At -O2 this emits ideal asm: mov rax, QWORD PTR [rdi] mov rdx, QWORD PTR [rdi-8+rsi] ret At -O3 it is comical: movqxmm0, QWORD PTR [rdi] movhps xmm0, QWORD PTR [rdi-8+rsi] movaps XMMWORD PTR [rsp-24], xmm0 mov rax, QWORD PTR [rsp-24] mov rdx, QWORD PTR [rsp-16] ret This seems to have been introduced in gcc7
[Bug c++/89706] New: -Wstringop-truncation strncpy message is confusing and has psuedo-false-positives
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89706 Bug ID: 89706 Summary: -Wstringop-truncation strncpy message is confusing and has psuedo-false-positives Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- https://godbolt.org/z/BAeXQB "strncpy output truncated copying between 0 and 8 bytes from a string of length 8" doesn't make it obvious what the problem with the is or how to solve it. Once I decoded it[1], I realized it was warning about something the code was already set up to handle correctly. Specifically, the last line of func re-establishes the nul byte termination regardless of how many bytes are copied. (Yes I know that this code is a dumb use of strncpy and we should probably just be using memcpy, but that isn't what this warning is warning about.) [1] I was only able to decode it by tweaking the code until I got the "output truncated before terminating nul copying 8 bytes from a string of the same length" form of the warning. This happens if you comment out the if-block in this code. #include extern size_t len; extern char *buf; inline void func(const char* s) { size_t sz = strlen(s); if (sz > len - 1) sz = len - 1; strncpy(buf, s, sz); buf[sz] = '\0'; } void test() { const char* p = "Progress"; func(p); } > g++ -Wstringop-truncation -O2 In function 'void func(const char*)', inlined from 'void test()' at :16:9: :10:12: warning: 'char* strncpy(char*, const char*, size_t)' output truncated copying between 0 and 8 bytes from a string of length 8 [-Wstringop-truncation] 10 | strncpy(buf, s, sz); | ~~~^~~~
[Bug c++/89688] New: -Wstringop-overflow confused by 2D array of char
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89688 Bug ID: 89688 Summary: -Wstringop-overflow confused by 2D array of char Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- Also, for some reason it repeats the warning 3 times. https://godbolt.org/z/SStUsl // Fine extern const char a[2] = {'1'}; auto z = __builtin_strlen(a); // Warns extern const char aa[1][2] = {{'2'}}; auto zz = __builtin_strlen(aa[0]); > ~/opensource/gcc/prefix/bin/g++ -Wall str.cpp -fsyntax-only str.cpp:7:27: warning: ‘strlen’ argument missing terminating nul [-Wstringop-overflow=] 7 | auto zz = __builtin_strlen(aa[0]); | ^~~ str.cpp:6:19: note: referenced argument declared here 6 | extern const char aa[1][2] = {{'2'}}; | ^~ str.cpp:7:27: warning: ‘strlen’ argument missing terminating nul [-Wstringop-overflow=] 7 | auto zz = __builtin_strlen(aa[0]); | ^~~ str.cpp:6:19: note: referenced argument declared here 6 | extern const char aa[1][2] = {{'2'}}; | ^~ str.cpp:7:32: warning: ‘strlen’ argument missing terminating nul [-Wstringop-overflow=] 7 | auto zz = __builtin_strlen(aa[0]); |^ str.cpp:6:19: note: referenced argument declared here 6 | extern const char aa[1][2] = {{'2'}}; | ^~ Reduced example from real code at: https://github.com/boostorg/date_time/blob/b0437e2999a65668dc4178dbb817a89a382ece94/include/boost/date_time/special_values_formatter.hpp#L89-L92 + https://github.com/boostorg/date_time/blob/b0437e2999a65668dc4178dbb817a89a382ece94/include/boost/date_time/special_values_formatter.hpp#L43-L45
[Bug c++/89682] New: g++9 incorrectly disallows using private static method as default arg to ctor of template type
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89682 Bug ID: 89682 Summary: g++9 incorrectly disallows using private static method as default arg to ctor of template type Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- New in GCC9. No other compiler errors on this: https://godbolt.org/z/480KfZ template class C { class TagType {}; public: C(int, TagType = makeTag()); private: static TagType makeTag(); }; void test() { C(1); } > g++ -fsyntax-only./creduce_repro/test.ii ./creduce_repro/test.ii: In function ‘void test()’: ./creduce_repro/test.ii:5:29: error: ‘static C::TagType C::makeTag() [with T = int]’ is private within this context 5 | C(int, TagType = makeTag()); | ~~~^~ ./creduce_repro/test.ii:7:20: note: declared private here 7 | static TagType makeTag(); |^~~
[Bug c++/89640] [9 Regression] g++ chokes on lambda with __attribute__
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89640 --- Comment #4 from Mathias Stearn --- @Jakub, This code doesn't have either mutable or noexcept, so the "wrong place in the grammer" issue doesn't apply. That part of the fix seems correct and useful. While it seems correct to fix what the c++11-syle [[attribute]] appertains to to match the standard, it would be unfortunate to do the same to __attribute__(()) style attributes which are already outside of the standard. Until the standard provides some way to have an attribute that appertains to the lambda function, this part of the "fix" is breaking useful functionality that has existed since GCC-5 without offering any replacement.
[Bug c++/89640] [9 Regression] g++ chokes on lambda with __attribute__
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89640 --- Comment #2 from Mathias Stearn --- Unfortunately the c++ attributes syntax applies to the lambda type rather than the function, so the warning is correct. The old style __attribute__ syntax seems to be the only way to annotate the lambda function, which is why we use it here. We use something like this in a macro around code that builds error messages on our error paths, which is why it needs to be on a lambda. It made a notable shrink to the size of our primary .text section moving that stuff out to the .text.cold section.
[Bug c++/89640] New: g++ chokes on lambda with __attribute__
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89640 Bug ID: 89640 Summary: g++ chokes on lambda with __attribute__ Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- Regression in gcc 9 vs 8. https://godbolt.org/z/MGL0H4 void test() { []() __attribute__((noinline,cold)) { asm volatile(""); }(); } : In function 'void test()': :2:41: error: expected identifier before '{' token 2 | []() __attribute__((noinline,cold)) { | ^ :2:41: error: type-specifier invalid in lambda Compiler returned: 1
[Bug c++/88865] New: [[no_unique_address]] leads to sizeof(T) == 0, which cannot be
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88865 Bug ID: 88865 Summary: [[no_unique_address]] leads to sizeof(T) == 0, which cannot be Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- https://godbolt.org/z/rJT5X9 struct B {}; struct A { [[no_unique_address]] B a; [[no_unique_address]] B b; [[no_unique_address]] B c; [[no_unique_address]] B d; }; int f() { return sizeof(A); } f(): pushrbp mov rbp, rsp mov eax, 0 pop rbp ret In addition to the major issue that sizeof(A) must not be 0, it also must not be 1 either. It must be (at least) 4. http://eel.is/c++draft/intro.object#9.sentence-2 is very clear that [[no_unique_address]] (which clauses 7 and 8 define to mean a "subobject of zero size") only allows members of *different types* to overlap. a,b,c, and d are all distinct objects of the same type B, and therefore must have distinct addresses. > Two objects with overlapping lifetimes that are not bit-fields may have the > same address if one is nested within the other, or if at least one is a > subobject of zero size and they are of different types; otherwise, they have > distinct addresses and occupy disjoint bytes of storage. https://godbolt.org/z/160XGN shows that some parts of gcc seem to understand this rule, some something very strange must be going on.
[Bug c++/87312] New: statics in lambdas should be weak not local symbols
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87312 Bug ID: 87312 Summary: statics in lambdas should be weak not local symbols Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- https://godbolt.org/z/oSuiQO // IN HEADER: inline auto lambda = [] () -> int* { static int foo; return &foo; }; inline int* func() { static int foo; return &foo; }; // NOT IN HEADER: int* lambda_addr() { return lambda(); } int* func_addr() { return func(); } Both of the "foo" objects should have exactly one address in the whole program. The "foo" in func() will work correctly, but the "foo" in the lambda will incorrectly have one address in each TU where it is used.
[Bug c++/67012] decltype(auto) with trailing return type
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67012 Mathias Stearn changed: What|Removed |Added CC||redbeard0531 at gmail dot com --- Comment #1 from Mathias Stearn --- Still repros with latest trunk builds, even with -pedantic: https://godbolt.org/g/vMDcwg
[Bug c++/86276] New: Poor codegen when returning a std::vector
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86276 Bug ID: 86276 Summary: Poor codegen when returning a std::vector Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- https://godbolt.org/g/aCiuAy While I'm a bit sad that good() isn't just "ret", I think that the current rules for allocation elision require something like that codegen. I'd expect bad() to codegen to roughly the same as good(), but then rather than calling operator delete, it should store [rax, rax + 1, rax + 22] to the three qwords starting at the out pointer. Instead, it does things like checking if the vector pointer it just zeroed is still zero to see if it needs to be deleted. It also seems to register an exception landing pad (I'm guessing to cover the operator new call) to handle freeing the vector's memory, even though it should know it isn't holding any. If I had to guess, I'd say the problem is that it thinks the hidden return out pointer has escaped when it hasn't really until a successful return. I'm pretty sure nothing in the language allows any way to access the return value before a function returns like that, but I'm now really curious if I'm wrong. If there is, is there any way to tell gcc that I promise I'm not doing anything quite that stupid? PS- is it helpful to include the code and asm here in addition to the godbolt links? #include #include auto good() { std::vector something; something.reserve(22); something = {0x02}; //return something; Only difference from bad } auto bad() { std::vector something; something.reserve(22); something = {0x02}; return something; } good(): sub rsp, 8 mov edi, 22 calloperator new(unsigned long) mov BYTE PTR [rax], 2 mov rdi, rax add rsp, 8 jmp operator delete(void*) bad(): pushrbp pxorxmm0, xmm0 pushrbx mov rbx, rdi sub rsp, 24 mov QWORD PTR [rdi+16], 0 movups XMMWORD PTR [rdi], xmm0 mov edi, 22 calloperator new(unsigned long) mov rdi, QWORD PTR [rbx] testrdi, rdi je .L5 mov QWORD PTR [rsp+8], rax calloperator delete(void*) mov rax, QWORD PTR [rsp+8] .L5: mov BYTE PTR [rax], 2 lea rdx, [rax+22] mov QWORD PTR [rbx], rax add rax, 1 mov QWORD PTR [rbx+8], rax mov rax, rbx mov QWORD PTR [rbx+16], rdx add rsp, 24 pop rbx pop rbp ret mov rbp, rax jmp .L6 bad() [clone .cold.25]: .L6: mov rdi, QWORD PTR [rbx] testrdi, rdi je .L7 calloperator delete(void*) .L7: mov rdi, rbp call_Unwind_Resume
[Bug middle-end/86140] New: constprop clones with identical bodies
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86140 Bug ID: 86140 Summary: constprop clones with identical bodies Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- This was tweaked example code to demonstrate something totally unrelated, but I noticed it was doing the constprop clone thing for functions with identical instructions, which seemed odd. While this exact case doesn't matter, it seemed like it might be a sign of a deeper bug. https://godbolt.org/g/Zwpi7J [[gnu::noinline]] void f(const int*){ asm volatile(""); } void good() { constexpr static int arr[1000] = {}; f(arr); } void bad() { constexpr int arr[1000] = {}; f(arr); } f(int const*) [clone .constprop.0]: ret f(int const*) [clone .constprop.1]: ret f(int const*): ret good(): jmp f(int const*) [clone .constprop.0] bad(): jmp f(int const*) [clone .constprop.1]
[Bug c++/86072] New: Poor codegen with atomics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86072 Bug ID: 86072 Summary: Poor codegen with atomics Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- https://godbolt.org/g/aEFhq8 #include std::atomic progress{-1}; void combine_writes1() { // These should be reduced to a single store(0,release), // At least as long as release-sequnce includes same-thread // relaxed stores. If that is removed, this should just be // a single relaxed store. progress.store(0, std::memory_order_relaxed); progress.store(0, std::memory_order_relaxed); progress.store(0, std::memory_order_release); progress.store(0, std::memory_order_release); progress.store(0, std::memory_order_relaxed); progress.store(0, std::memory_order_relaxed); } void combine_writes2() { // Ditto above, but should store 5. progress.store(0, std::memory_order_relaxed); progress.store(1, std::memory_order_relaxed); progress.store(2, std::memory_order_release); progress.store(3, std::memory_order_release); progress.store(4, std::memory_order_relaxed); progress.store(5, std::memory_order_relaxed); } void combine_loads() { // These should be reduced to either a single acquire-load // or an acquire fence. progress.load(std::memory_order_relaxed); progress.load(std::memory_order_relaxed); progress.load(std::memory_order_acquire); progress.load(std::memory_order_acquire); progress.load(std::memory_order_relaxed); progress.load(std::memory_order_relaxed); } combine_writes1(): xor eax, eax mov DWORD PTR progress[rip], eax mov DWORD PTR progress[rip], eax mov DWORD PTR progress[rip], eax mov DWORD PTR progress[rip], eax mov DWORD PTR progress[rip], eax mov DWORD PTR progress[rip], eax ret combine_writes2(): mov DWORD PTR progress[rip], 0 mov DWORD PTR progress[rip], 1 mov DWORD PTR progress[rip], 2 mov DWORD PTR progress[rip], 3 mov DWORD PTR progress[rip], 4 mov DWORD PTR progress[rip], 5 ret combine_loads(): mov eax, DWORD PTR progress[rip] mov eax, DWORD PTR progress[rip] mov eax, DWORD PTR progress[rip] mov eax, DWORD PTR progress[rip] mov eax, DWORD PTR progress[rip] mov eax, DWORD PTR progress[rip] ret progress: .long -1 This example seems to ICE segfaulting gcc trunk: https://godbolt.org/g/M4ZQGS Possibly related: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86056
[Bug middle-end/86056] Codegen can result in multiple sequential mfence instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86056 --- Comment #2 from Mathias Stearn --- Oh, I agree that this optimization must be permitted. I was using this example to prove this to someone else who didn't believe that. I only mentioned that to explain how that weird source code came to be. My point of this bug was that the code gen can be further optimized because there is never a good reason to have multiple mfence instructions back to back like that.
[Bug target/86056] New: Codegen can result in multiple sequential mfence instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86056 Bug ID: 86056 Summary: Codegen can result in multiple sequential mfence instructions Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- This is from an example to prove that atomic_thread_fence does not prevent the compiler from optimizing non-escaped memory. https://godbolt.org/g/288mTC #include #include struct Type { Type(Type&&)=default; int i; }; Type func(Type t) { auto out = Type(Type(std::move(t))); std::atomic_thread_fence(std::memory_order_seq_cst); return out; } auto func2(Type t) { return func(func(func(func(std::move(t); } func(Type): movl (%rsi), %edx movq %rdi, %rax movl %edx, (%rdi) mfence ret func2(Type): movl (%rsi), %edx movq %rdi, %rax mfence mfence mfence movl %edx, (%rdi) mfence ret
[Bug libstdc++/85813] make_exception_ptr should support __cxa_init_primary_exception path even with -fno-exceptions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85813 --- Comment #7 from Mathias Stearn --- > Simply returning an empty exception_ptr is what happened before the PR 64241 > change, so what we do now retains that behaviour. That might be the main > reason for it. Silently dropping errors always skeeves me out. I'm not sure if anyone is well served by the current behavior. If it were up to me (and I know it isn't) I'd make that case call std::terminate() or similar rather than returning the "no error" value. That seems consistent with the behavior of the __throw* functions, but it is a breaking change. Or even better, since gcc seems fine throwing through functions marked noexcept in -fno-exceptions TUs, maybe in the (very rare) case where copying/moving an exception throws inside an -fno-exceptions TU, just let it bubble out. > ::new (__e) _Ex(std::forward<_Ex>(__ex)); Should that be std::move(__ex)? I don't know why, but make_exception_ptr takes the exception by value rather than forwarding ref. I guess forward<_Ex> is fine either way, and will stay correct if that defect gets fixed. It just seemed odd to forward a value so I thought I'd mention it. > Mix-and-match works if the function gets inlined. Do you think this case warrants a [[gnu::always_inline]]? Given the sensitive nature of error handling, it seems pretty bad if a correct usage of make_exception_ptr() could be silently transformed to return the "no error" value just by linking in a bad library. Even if that library never dynamically executes make_exception_ptr(). I know I'd hate to be called in to debug that issue... I'm going to go make sure no code we link with uses -fno-exceptions!
[Bug libstdc++/85813] make_exception_ptr should support __cxa_init_primary_exception path even with -fno-exceptions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85813 --- Comment #3 from Mathias Stearn --- My assumption was that if E(...) throws and it can't be caught, it should be treated as any other case when an -fno-exceptions TU calls a throwing function. In this case that would mean calling terminate() due to the noexcept, which seems better than returning a null exception_ptr. However, while testing the behavior of mixing -fno-exceptions TUs with normal ones, I realized there may be a bigger problem due to ODR violations. In particular, if you link these TUs without optimizations, one of the asserts will trip depending on the link order: // g++ -fno-exceptions -c #include #include void no_exceptions() { assert(!std::make_exception_ptr(1)); } // g++ -fexceptions #include #include void no_exceptions(); int main() { assert(std::make_exception_ptr(1)); no_exceptions(); } Is that just accepted, implying the the whole program must be compiled with either -fexceptions or -fno-exeptions, rather than allowing mix-and-match? If so, I guess this whole point is moot.
[Bug libstdc++/85813] New: make_exception_ptr should support __cxa_init_primary_exception path even with -fno-exceptions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85813 Bug ID: 85813 Summary: make_exception_ptr should support __cxa_init_primary_exception path even with -fno-exceptions Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: libstdc++ Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- It looks like nothing in that path actually needs exception support. This would allow make_exception_ptr to produce a valid exception_ptr even in TUs that don't support exceptions. While that may not sound valuable, the exception_ptr could be handed to TUs that do support exceptions where it can be rethrown (a misnomer in this case...) and caught. Also, if my paper http://wg21.link/p1066 is accepted, that would provide full support for handling the exception without ever throwing it. And if this ticket is impossible to implement for some reason, I'd love to know that before Rapperswil ;) The nullptr return was added in response to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64241, but it looks like that was done before the __cxa_init_primary_exception path existed. When that was added in https://github.com/gcc-mirror/gcc/commit/cce8e59224e18858749a2324bce583bcfd160d6c#diff-e1b5856b8cc940de58c12ef252d63c34R188, I can't tell if it was put in the #if __cpp_exceptions block for a good reason or just by default.
[Bug libstdc++/85812] New: make_exception_ptr can leak the allocated exception if construction throws
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85812 Bug ID: 85812 Summary: make_exception_ptr can leak the allocated exception if construction throws Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: libstdc++ Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- I think if this line throws the exception allocated on line 185 will leak: https://github.com/gcc-mirror/gcc/blob/3474beffb1fd23da44a752763e648d5d1ffa4d0f/libstdc%2B%2B-v3/libsupc%2B%2B/exception_ptr.h#L189.
[Bug c++/85799] __builin_expect doesn't propagate through inlined functions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85799 --- Comment #1 from Mathias Stearn --- LLVM bug: https://bugs.llvm.org/show_bug.cgi?id=37476
[Bug c++/85799] New: __builin_expect doesn't propagate through inlined functions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85799 Bug ID: 85799 Summary: __builin_expect doesn't propagate through inlined functions Product: gcc Version: 8.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- It seems like __builtin_expect doesn't propagate to conditions when inlined. This is unfortunate because in some cases it would be nice to be able to put the expectation into methods so that every user doesn't need to do their own hinting. Currently we need to use macros to achieve this. See https://godbolt.org/g/MuPM77 for full example inline bool expect_false(bool b) { return __builtin_expect(b, false); } void inline_func_hint(bool b) { if (expect_false(b)) { unlikely(); // treated as more likely!!! } else { likely(); } } inline_func_hint(bool): testdil, dil je .L11 jmp unlikely() .L11: jmp likely()
[Bug c++/83400] g++ -O1 doesn't execute any code in destructors with a throw statement if it sees another throw
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83400 --- Comment #3 from Mathias Stearn --- This is related to https://wg21.link/CWG2219
[Bug c++/85736] New: Support warn_unused or warn_unused_result on specific constructors
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85736 Bug ID: 85736 Summary: Support warn_unused or warn_unused_result on specific constructors Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- It would be nice to get the benefits of those attributes on a per-constructor basis, rather than requiring them on the whole type. The particular use case I have in mind is for unique_lock's default constructor (or at least on our wrapper around it). I recently did a code review where someone typed: std::unique_lock myMutex; where they meant to use: std::unique_lock lk(myMutex); There is currently no warning for this at -Wall -Wextra, although thankfully it is at least caught when myMutex has parentheses around it, which is the more common mistake. Clearly, it wouldn't make sense to put warn_unused on the whole unique_lock since the second line is fine. It would probably make sense on almost all default constructors actually, since with the exception of a few specific types that alter global or thread-local state, why are you declaring a default constructed variable then not using it at all? But on a few types like unique_lock it seems actively dangerous rather than just simply wasteful.
[Bug middle-end/85721] bad codegen for looped copy of primitives at -O2 and -O3 (differently bad)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85721 --- Comment #4 from Mathias Stearn --- Marc Glisse pointed out at https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85720#c3 that my I missed an aliasing case when I created this ticket. memmove isn't a valid replacement if out is in the range (in, in + n). I did some benchmarking to see what the best solution is and how much this matters. This seems to do the best on sandybridge, haswell, and an Opteron 6344 Piledriver: [[gnu::noinline, gnu::optimize("s")]] void copy0(char* out, const char* in, size_t n) { if (n >= 8 &&__builtin_expect(out >= in + n || out + n <= in, 1)) { memcpy(out, in, n); return; } for (size_t i = 0; i < n; i++){ out[i] = in[i]; } } copy0(char*, char const*, unsigned long): cmp rdx, 7 jbe .L7 lea rax, [rsi+rdx] cmp rdi, rax jnb .L3 lea rax, [rdi+rdx] cmp rsi, rax jb .L7 .L3: jmp memcpy .L7: xor eax, eax .L5: cmp rax, rdx je .L1 mov cl, BYTE PTR [rsi+rax] mov BYTE PTR [rdi+rax], cl inc rax jmp .L5 .L1: ret With char, it is substantially faster than the current codegen for the orignal loop at -O2 and moderately faster than -O3, while being about 10% the size. With a TriviallyCopiable type with a non-trivial default ctor, even -O3 does byte-by-byte, so it is a substantial win there as well. Let me know if you'd like me to post the benchmark I was using.
[Bug middle-end/85720] bad codegen for looped assignment of primitives at -O2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85720 --- Comment #4 from Mathias Stearn --- (In reply to Marc Glisse from comment #3) > Again, you are ignoring aliasing issues (just like in your other PR the > function copy isn't equivalent to memmove). Does adding __restrict change > the result? Also, B[i]=B[i]+1 doesn't look like a memset... Sorry, I typoed. It was supposed to be B[i] = A[i] + 1. That still does basically the same thing though: https://godbolt.org/g/dtmU5t. Good point about aliasing though. I guess the right code gen in that case would actually be something that detected the overlap and did the right calls to memset to only set each byte once. Or just do the simple thing: if (b > a && b < a + n) { memset(b, 1, n); memset(a, 0, n); } else { memset(a, 0, n); memset(b, 1, n); } Yes, __restrict helps, but that isn't part of standard c++, and it seems like it never will be.
[Bug middle-end/85721] bad codegen for looped copy of primitives at -O2 and -O3 (differently bad)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85721 --- Comment #3 from Mathias Stearn --- @Jonathan Wakely, that is because std::copy cheats and calls memmove directly. A slight modification of the type that shouldn't matter defeats that optimization and causes both forms to degrade to byte-by-byte: https://godbolt.org/g/Z4fWNT. I actually consider that the fact that the explicit optimization in std::copy is necessary should be considered an optimizer bug. Why isn't the optimizer noticing that the simple implementation is the same as a memmove and do the transformation for you? It generally seems unfortunate if similar code that clearly means the same thing results in very different performance. Ideally, even providing user-defined inline copy operations should result in calling memmove if they are equivalent since the compiler should be able to "see through" the non-triviality and optimize it all away. I'm filing these tickets based on what Martin Sebor said: "As an aside, using assignment and copy constructors should be at least as efficient as calling memset and memcpy, and ideally more, and they should always be preferred over the raw memory functions. If/where they aren't as efficient please open bugs for missing optimizations." Do you disagree with that statement?
[Bug middle-end/85720] bad codegen for looped assignment of primitives at -O2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85720 --- Comment #2 from Mathias Stearn --- Hmm. Taking the example from the -ftree-loop-distribute-patterns documentation, it still seems to generate poor code, this time at both -O2 and -O3: https://godbolt.org/g/EsQDj8 Why isn't that transformed to memset(A, 0, N); memset(B, 1, N); ? This feels similar to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85721. Should I make a new ticket with this example?
[Bug c++/85721] New: bad codegen for looped copy of primitives at -O2 and -O3 (differently bad)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85721 Bug ID: 85721 Summary: bad codegen for looped copy of primitives at -O2 and -O3 (differently bad) Product: gcc Version: 8.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- https://godbolt.org/g/Gg9fFt Related to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85720, but filed separately because this also affects -O3. Similarly, while this affects types other than char, char is most egregious. using SIZE_T = decltype(sizeof(0)); void copy(char* out, const char* in, SIZE_T n) { for (SIZE_T i = 0; i < n; i++){ out[i] = in[i]; } } This should probably just be compiled to check size then jmp memmove. At O2 it copies byte-by-byte: copy(char*, char const*, unsigned long): testrdx, rdx je .L1 xor eax, eax .L3: movzx ecx, BYTE PTR [rsi+rax] mov BYTE PTR [rdi+rax], cl add rax, 1 cmp rdx, rax jne .L3 .L1: ret At O3 it generates a TON of code: copy(char*, char const*, unsigned long): test rdx, rdx je .L1 lea rax, [rsi+16] cmp rdi, rax lea rax, [rdi+16] setnb cl cmp rsi, rax setnb al or cl, al je .L7 lea rax, [rdx-1] cmp rax, 14 jbe .L7 mov rcx, rdx xor eax, eax and rcx, -16 .L4: movdqu xmm0, XMMWORD PTR [rsi+rax] movups XMMWORD PTR [rdi+rax], xmm0 add rax, 16 cmp rax, rcx jne .L4 mov rax, rdx and rax, -16 cmp rdx, rax je .L1 movzx ecx, BYTE PTR [rsi+rax] mov BYTE PTR [rdi+rax], cl lea rcx, [rax+1] cmp rdx, rcx jbe .L1 movzx ecx, BYTE PTR [rsi+1+rax] mov BYTE PTR [rdi+1+rax], cl lea rcx, [rax+2] cmp rdx, rcx jbe .L1 movzx ecx, BYTE PTR [rsi+2+rax] mov BYTE PTR [rdi+2+rax], cl lea rcx, [rax+3] cmp rdx, rcx jbe .L1 movzx ecx, BYTE PTR [rsi+3+rax] mov BYTE PTR [rdi+3+rax], cl lea rcx, [rax+4] cmp rdx, rcx jbe .L1 movzx ecx, BYTE PTR [rsi+4+rax] mov BYTE PTR [rdi+4+rax], cl lea rcx, [rax+5] cmp rdx, rcx jbe .L1 movzx ecx, BYTE PTR [rsi+5+rax] mov BYTE PTR [rdi+5+rax], cl lea rcx, [rax+6] cmp rdx, rcx jbe .L1 movzx ecx, BYTE PTR [rsi+6+rax] mov BYTE PTR [rdi+6+rax], cl lea rcx, [rax+7] cmp rdx, rcx jbe .L1 movzx ecx, BYTE PTR [rsi+7+rax] mov BYTE PTR [rdi+7+rax], cl lea rcx, [rax+8] cmp rdx, rcx jbe .L1 movzx ecx, BYTE PTR [rsi+8+rax] mov BYTE PTR [rdi+8+rax], cl lea rcx, [rax+9] cmp rdx, rcx jbe .L1 movzx ecx, BYTE PTR [rsi+9+rax] mov BYTE PTR [rdi+9+rax], cl lea rcx, [rax+10] cmp rdx, rcx jbe .L1 movzx ecx, BYTE PTR [rsi+10+rax] mov BYTE PTR [rdi+10+rax], cl lea rcx, [rax+11] cmp rdx, rcx jbe .L1 movzx ecx, BYTE PTR [rsi+11+rax] mov BYTE PTR [rdi+11+rax], cl lea rcx, [rax+12] cmp rdx, rcx jbe .L1 movzx ecx, BYTE PTR [rsi+12+rax] mov BYTE PTR [rdi+12+rax], cl lea rcx, [rax+13] cmp rdx, rcx jbe .L1 movzx ecx, BYTE PTR [rsi+13+rax] mov BYTE PTR [rdi+13+rax], cl lea rcx, [rax+14] cmp rdx, rcx jbe .L1 movzx edx, BYTE PTR [rsi+14+rax] mov BYTE PTR [rdi+14+rax], dl ret .L7: xor eax, eax .L3: movzx ecx, BYTE PTR [rsi+rax] mov BYTE PTR [rdi+rax], cl add rax, 1 cmp rdx, rax jne .L3 .L1: ret A) This should probably just call memmove which has a tuned implementation for many architectures and uses ifunc dispatch to choose the right one based on the runtime CPU rather than the compile-time settings. Also, all functions like this for all types would all just jump to a single function, there should be I$ advantages. B) If you really want to emit code for this rather than calling into libc, it is probably best to use their technique of overlapped reads and writes for the last vector rather than going into an unrolled byte-by-byte loop: https://github.molgen.mpg.de/git-mirror/glibc/blob/20003c49884422da7ffbc459cdeee768a6fee07b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S#L331-L335
[Bug c++/85720] New: bad codegen for looped assignment of primitives at -O2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85720 Bug ID: 85720 Summary: bad codegen for looped assignment of primitives at -O2 Product: gcc Version: 8.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- https://godbolt.org/g/qp19Cv using SIZE_T = decltype(sizeof(0)); void fill(char* p, SIZE_T n) { for (SIZE_T i = 0; i < n; i++){ p[i] = -1; } } fill(char*, unsigned long): testrsi, rsi je .L1 add rsi, rdi .L3: mov BYTE PTR [rdi], -1 add rdi, 1 cmp rdi, rsi jne .L3 .L1: ret At -O3 it basically just tail-calls memset. Also applies to other types than char, but char is most egregious. This ticket is spun out of from https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85707
[Bug c++/85707] -Wclass-memaccess should excempt safe usage inside of a class and its friends
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85707 --- Comment #8 from Mathias Stearn --- > If the constructor had other side-effects (e.g., count > the number of objects of the class) bypassing it could > violate the invariant. I thought one of the points of friendship was to allow friends to violate a class's invariants temporarily as long as they promise that the class is left in a valid state in the end. Since presumably a class and its friends are maintained by the same people, they are aware of what the actual requirements of the class are, even if they can't be stated precisely in the language today.
[Bug c++/85707] -Wclass-memaccess should excempt safe usage inside of a class and its friends
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85707 --- Comment #2 from Mathias Stearn --- Here is a boiled down example of some of our code that trips this warning: https://godbolt.org/g/ChLrch. It also shows why we do this, since the codegen is *substantially* better for init_table_memset than init_table_assignment, at least at -O2. Even if you improve the codegen for that case tomorrow, we'd still need to keep using memset for a while until we stop supporting older compilers. This is reduced from https://github.com/mongodb/mongo/blob/11a3d5ccb1216da0e84d941fd48e486f72455ba4/src/mongo/db/pipeline/document_internal.h. The actual key type is stored as a variable-lenth string stored directly in the buffer and the Key type in the interface is our pre-17 string_view equivalent. The value is actually a type called Value, that holds an internal 16-byte type called ValueStorage as its only member. ValueStorage also triggers the warning in its lifetime methods: https://github.com/mongodb/mongo/blob/11a3d5ccb1216da0e84d941fd48e486f72455ba4/src/mongo/db/pipeline/value_internal.h#L165-L221 (the DEV macro expands to "if (DEBUG_BUILD)" so you can ignore anything on those lines). If necessary I can try to boil down that type too, but it is much more complex, so I'm not sure how small I can make it. This is all to implement what is essentially a dynamically-typed JSON object which is something we need to be *REALLY* fast. A lot of effort went into micro-optimizing these types so that the business logic code doesn't need to worry about any of this and can write very natural looking, modern c++ code. All of this memory-munging is hidden in internal types in _internal.h files. The user facing types are not supposed to expose any of this, except to the implementations which are all friendly which each other. The third_party code that is tripping this is in S2. It tries to memcpy https://github.com/mongodb/mongo/blob/11a3d5ccb1216da0e84d941fd48e486f72455ba4/src/mongo/db/pipeline/value_internal.h#L165-L221 an array of S2Points (typedef for Vector3) https://github.com/mongodb/mongo/blob/11a3d5ccb1216da0e84d941fd48e486f72455ba4/src/third_party/s2/util/math/vector3.h#L30. This doesn't meet the self-or-friend condition described in the ticket, so I'm not sure what the solution there is, but it is an example of code that is correct (I think, I haven't reviewed it *too* closely) but still triggers this warning.
[Bug c++/85707] New: -Wclass-memaccess should excempt safe usage inside of a class and its friends
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85707 Bug ID: 85707 Summary: -Wclass-memaccess should excempt safe usage inside of a class and its friends Product: gcc Version: 8.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- I get the point of the warning and would like to be able to turn it on. However we have some types that have been specifically designed to be memset and memcpy/memmove-able as long as certain rules are followed. This is an implementation detail of the type so non-friend users are expected to use the default/copy/move constructors, rather than directly manipulating the bytes. However classes (and their friends) are always allowed to violate their own invariants, even if external users aren't. That is why I think the warning should be suppressed* inside of contexts that have access to internals. *Ideally it would still trip if the class's members were not mem-accessible by it, but it seems more important (to me) to avoid the false-positives than to avoid these kinds of false negatives if only one is possible. I already know about the void*-cast to suppress the warning. I tried doing that in our code base and it was required in too many places, all of which were correct (as in no actual misuse was found). Additionally, this trips in third-party code that we'd rather not alter unless there is an actual bug. As a potential alternative, adding an attribute like [[gnu::memaccessable]] that can be put on a type to suppress the warning for uses of that type might also work.
[Bug c++/85680] Missed optimization for value-init of variable-sized allocation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85680 --- Comment #3 from Mathias Stearn --- MSVC and ICC both also handle this poorly: https://godbolt.org/g/i4wMYa https://developercommunity.visualstudio.com/content/problem/246786/poor-codegen-for-value-init-followed-by-explicit-i.html
[Bug c++/85680] Missed optimization for value-init of variable-sized allocation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85680 --- Comment #2 from Mathias Stearn --- FYI, I don't think this is a signed/unsigned thing since it also repros with unsigned long https://godbolt.org/g/LTmrpi My initial repo actually used size_t, but I (incorrectly) changed it to long rather than unsigned long based on past experience that compiler authors prefer repros that don't have any #includes and the (IMO) sad fact that size_t still isn't actually part of the language.
[Bug c++/85680] New: Missed optimization for value-init of variable-sized allocation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85680 Bug ID: 85680 Summary: Missed optimization for value-init of variable-sized allocation Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- https://godbolt.org/g/6yZEKf (compiled with -O3) char* variableSize(long n) { auto p = new char[n](); for (int i = 0; i < n; i++) { p[i] = 0xff; } return p; } char* small() { return variableSize(8); } char* large() { return variableSize(10'000); } The variableSize() case generates two calls two memset with the both being conditional on n > 0 if I'm reading this right), but the two checks are done in different ways. That may be a second optimizer bug that it missed a jump-threading opportunity. variableSize(long): pushrbx mov rbx, rdi calloperator new[](unsigned long) mov rcx, rax mov rax, rbx sub rax, 1 js .L5 lea rax, [rbx-2] mov edx, 1 mov rdi, rcx cmp rax, -1 cmovge rdx, rbx xor esi, esi callmemset mov rcx, rax .L5: testrbx, rbx jle .L1 mov rdi, rcx mov rdx, rbx mov esi, 255 callmemset mov rcx, rax .L1: mov rax, rcx pop rbx ret Note that when g++ can see the allocation size it *does* elide the initial memset: small(): # @small() push rax mov edi, 8 call operator new[](unsigned long) mov qword ptr [rax], -1 pop rcx ret large(): # @large() push rbx mov edi, 1 call operator new[](unsigned long) mov rbx, rax mov esi, 255 mov edx, 1 mov rdi, rax call memset mov rax, rbx pop rbx ret Related clang bug: https://bugs.llvm.org/show_bug.cgi?id=37351
[Bug c++/83400] New: g++ -O1 doesn't execute any code in destructors with a throw statement if it sees another throw
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83400 Bug ID: 83400 Summary: g++ -O1 doesn't execute any code in destructors with a throw statement if it sees another throw Product: gcc Version: 7.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- Repro code below doesn't print the "in dtor" line when compiled with -O1 or higher. It does with g++ -O0 and with clang++ at any optimization level. Note that even with the raise() call enabled, which would prevent the second exception from firing, it still won't print "in dtor". I've reproed with g++ 4.8.2, 5.4.0, 7.2.0, and 7.2.1, all on linux. No warnings are output even with -Wall -Wextra. #include #include #include #include struct ThrowInDtor{ ~ThrowInDtor() noexcept(false) { std::cout << "in dtor" << std::endl; // Doesn't print. // raise(SIGINT); // Still repros with this commented in. throw std::exception(); } }; int main() { try { ThrowInDtor throws; throw std::exception(); } catch (std::exception&) { // unreachable } }
[Bug tree-optimization/77943] [5/6 Regression] Optimization incorrectly commons noexcept calls with non-noexcept calls
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77943 --- Comment #12 from Mathias Stearn --- We were hoping to replace many places in our code that does the following with noexcept attributes on functions: > try {doStuffThatShouldntThrow();} catch(...) {std::terminate();} We wanted to take advantage of noexcept preventing exceptions from propagating in a way that causes the terminate handler to be invoked at the throw site rather than the catch site. We've avoided doing this so far because of the bugs that we've found, leaving us with low confidence that we can rely on it. Can you think of any similar cases that are likely to cause issues? Where do you think it makes sense to focus our testing? The scariest issues we've seen were a combination of failures to enforce noexcept (like this bug) combined with noexcept causing try/catch blocks to be optimized out. That lead to an exception thrown from a noexcept function bypassing the surrounding catch block and escaping two layers of protection and reaching code that really should never see them. Are there any compiler flags we can use to tell it not to eliminate the catches based on noexcept annotations? We've also noticed that 'throw()' annotations have fewer issues, but have avoided them because they are deprecated. As compiler writers, would you trust 'throw()' more or suggest we stick with noexcept?
[Bug c++/77943] Optimization incorrectly commons noexcept calls with non-noexcept calls
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77943 --- Comment #3 from Mathias Stearn --- > Not being a C++ expect, but following spec: > http://en.cppreference.com/w/cpp/language/noexcept_spec > > "If a search for a matching exception handler leaves a function marked > noexcept or noexcept(true), std::terminate is called immediately." > > Which is probably what happens in fatal() function The problem is that it merges the calls to fatal() and notFatal() into a single call with no branch, so there is no way to only call std::terminate() if the call to throws() was through fatal(). (In reply to Martin Liška from comment #2) PS - I realized I uploaded an old version of the repro that shows the alternate incorrect behavior of always calling std::terminate even when it should go through nonFatal(). Sorry about that. If you reverse the branches of the if/else you'll get the behavior I describe in the initial report.
[Bug c++/77943] Optimization incorrectly commons noexcept calls with non-noexcept calls
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77943 --- Comment #1 from Mathias Stearn --- Created attachment 39787 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=39787&action=edit Reproducer
[Bug c++/77943] New: Optimization incorrectly commons noexcept calls with non-noexcept calls
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77943 Bug ID: 77943 Summary: Optimization incorrectly commons noexcept calls with non-noexcept calls Product: gcc Version: 6.2.1 Status: UNCONFIRMED Severity: critical Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- When compiled with -O0 and -01 the program behaves correctly and only calls std::terminate() when passed an argument. With -O2 and -O3 it doesn't call std::terminate() which means that it allowed an exception to escape from a noexcept function in violation of the standard. > g++ -O0 noexcept_test.cpp && ./a.out ; ./a.out 1 should die: 0 should die: 1 terminate called after throwing an instance of 'int' Aborted (core dumped) > g++ -O1 noexcept_test.cpp && ./a.out ; ./a.out 1 should die: 0 should die: 1 terminate called after throwing an instance of 'int' Aborted (core dumped) > g++ -O2 noexcept_test.cpp && ./a.out ; ./a.out 1 should die: 0 should die: 1 > g++ -O3 noexcept_test.cpp && ./a.out ; ./a.out 1 should die: 0 should die: 1 The actual behavior with -O2 and -O3 depends on which way the if block is written. The attached version is the scariest, but if you swap the if and else blocks and remove the ! from the condition, the executable will always call std::terminate() even when it shouldn't.