[Bug target/108322] Using __restrict parameter with -ftree-vectorize (default with -O2) results in massive code bloat
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108322 --- Comment #5 from Alexander Monakov --- (In reply to Richard Biener from comment #4) > > For the case at hand loading two vectors from the destination and then > punpck{h,l}bw and storing them again might be the most efficient thing > to do here. I think such read-modify-write on the destination introduces a data race for bytes that are not accessed in the original program, so that would be okay only under -fallow-store-data-races?
[Bug target/108322] Using __restrict parameter with -ftree-vectorize (default with -O2) results in massive code bloat
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108322 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #3 from Alexander Monakov --- With '-fdisable-tree-forwprop4 -msse4.1' you see what the vectorizer perhaps wanted to achieve.
[Bug rtl-optimization/108318] Floating point calculation moved out of loop despite fesetround
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108318 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #1 from Alexander Monakov --- Please see documentation for the -frounding-math option, but even with that option added, your testcase still has the faux-invariant moved by RTL PRE (-fno-gcse). Interestingly, if your testcase is modified to compute the sum before the call: #include void foo (double res[4], double a, double b, double x[]) { a = x[0]; b = x[1]; static const int rm[4] = { FE_DOWNWARD, FE_TONEAREST, FE_TOWARDZERO, FE_UPWARD }; for (int i = 0; i < 4; ++i) { double t = a + b; fesetround (rm[i]); res[i] = t; } fesetround (FE_TONEAREST); // restore default } Then it demonstrates how a few *other* optimizations also perform unwanted motion: * SSA PRE (-fno-tree-pre) * TER (-fno-tree-ter) * RTL LIM (-fno-move-loop-invariants) * and finally the register allocator (unavoidable)
[Bug target/108315] New: -mcpu=power10 changes ABI
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108315 Bug ID: 108315 Summary: -mcpu=power10 changes ABI Product: gcc Version: 13.0 Status: UNCONFIRMED Keywords: ABI, wrong-code Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: amonakov at gcc dot gnu.org Target Milestone: --- Target: powerpc64le-*-* Created attachment 54202 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=54202=edit testcase At least the documentation should mention that if intentional. In the attached example, the function bar is compiled to bar: .localentry bar,1 mtctr 3 mr 12,3 bctr .long 0 .byte 0,0,0,0,0,0,0,0 i.e. it does not preserve r2 (it's compiled with -mcpu=power10). If the caller is not compiled with -mcpu=power10, it needs r2 preserved (bar has a localentry, so the nop in the caller stays a nop after linking). I verified the testcase misbehaves on Compile Farm's gcc135: as it does not use any power10-specific instructions, it's runnable there.
[Bug middle-end/108256] New: Missing integer overflow instrumentation when assignment LHS is narrow
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108256 Bug ID: 108256 Summary: Missing integer overflow instrumentation when assignment LHS is narrow Product: gcc Version: 13.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: amonakov at gcc dot gnu.org Target Milestone: --- For unsigned short f(unsigned short x, unsigned short y) { return x * y; } unsigned short g(unsigned short x, unsigned short y) { int r = x * y; return r; } gcc -O2 -fsanitize=undefined emits instrumentation only for 'g', although both are equivalent. When 'int r' is changed to 'unsigned short r', 'g' is also not instrumented. PR 107912 shows a slightly more complicated variant of this. Affects both C and C++.
[Bug target/108229] [13 Regression] unprofitable STV transform since r13-4873-g0b2c1369d035e928
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108229 --- Comment #3 from Alexander Monakov --- Thank you! I considered this unprofitable for these reasons: 1. As you said, the code grows in size, but the speed benefit is not clear. 2. The transform converts load+add operations in a loop, and their final uses outside of the loop. How does the costing work in this case, i.e. how are changes for the more frequently executed instructions are weighted against changes for the instructions that will be executed once? 3. The scalar 'add reg, mem' instruction results in one micro-fused uop that is handled as one uop during renaming (one of narrowest point in the pipeline). It is then issued on two execution units (for the load and for the add). 4. On AMD, there are separate fp/simd pipes, so when the code is already simd-heavy as in this example, STV offloads instructions from the integer pipes to the possibly already-busy simd/fp pipes. That said, the transformed portion is small relative to the inner loop of the example, so benchmarking yesterday's trunk with/without -mno-stv on Zen 2, I get: 27.26 bytes/cycle, 3.07 instruction/cycle vs. 26.01 bytes/cycle, 2.97 instruction/cycle So it's not the end of the world for this particular example, but I wanted to raise the issue in case there's a costing problem in STV that needs correcting.
[Bug target/108229] New: [13 Regression] unprofitable STV transform
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108229 Bug ID: 108229 Summary: [13 Regression] unprofitable STV transform Product: gcc Version: 13.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: amonakov at gcc dot gnu.org Target Milestone: --- Target: x86_64-*-* In the following example, STV is making a very unprofitable transformation on trunk, but not on gcc-12: #include #include struct b { struct b *next; uint64_t data[511]; }; typedef uint64_t u64v2 __attribute__((vector_size(16))); static inline void vsum(u64v2 s[], uint64_t *x, size_t n) { typedef u64v2 u64v2_u __attribute__((may_alias)); u64v2_u *vx = (void *)x; for (; n; vx += 4, n -= 8) { s[0] += vx[0]; s[1] += vx[1]; s[2] += vx[2]; s[3] += vx[3]; } } uint64_t sum(struct b *b) { uint64_t s = 0; u64v2 vs[4] = { 0 }; do { vsum(vs, b->data + 7, 511-7); #pragma GCC unroll(7) for (int i = 0; i < 7; i++) s += b->data[i]; } while ((b = b->next)); vs[0] += vs[1] + vs[2] + vs[3]; return s + vs[0][0] + vs[0][1]; } gcc -O2 -mavx (-mavx is not necessary, plain -O2 also triggers it): sum: vpxor xmm2, xmm2, xmm2 vmovdqa xmm1, xmm2 vmovdqa xmm3, xmm2 vmovdqa xmm0, xmm2 vmovdqa xmm5, xmm2 .L3: lea rax, [rdi+64] lea rdx, [rdi+4096] .L2: vpaddq xmm0, xmm0, XMMWORD PTR [rax] vpaddq xmm3, xmm3, XMMWORD PTR [rax+16] add rax, 64 vpaddq xmm1, xmm1, XMMWORD PTR [rax-32] vpaddq xmm2, xmm2, XMMWORD PTR [rax-16] cmp rdx, rax jne .L2 vmovq xmm6, QWORD PTR [rdi+16] vmovq xmm4, QWORD PTR [rdi+8] vpaddq xmm4, xmm4, xmm6 vpaddq xmm4, xmm4, xmm5 vmovq xmm5, QWORD PTR [rdi+24] vpaddq xmm4, xmm4, xmm5 vmovq xmm5, QWORD PTR [rdi+32] vpaddq xmm4, xmm4, xmm5 vmovq xmm5, QWORD PTR [rdi+40] vpaddq xmm4, xmm4, xmm5 vmovq xmm5, QWORD PTR [rdi+48] vpaddq xmm4, xmm4, xmm5 vmovq xmm5, QWORD PTR [rdi+56] mov rdi, QWORD PTR [rdi] vpaddq xmm5, xmm4, xmm5 testrdi, rdi jne .L3 vpaddq xmm1, xmm1, xmm2 vpaddq xmm0, xmm0, xmm3 vpaddq xmm0, xmm0, xmm1 vmovdqa xmm1, xmm0 vpsrldq xmm0, xmm0, 8 vpaddq xmm0, xmm1, xmm0 vpaddq xmm0, xmm0, xmm5 vmovq rax, xmm0 ret compare with gcc -O2 -mavx -mno-stv: sum: vpxor xmm2, xmm2, xmm2 xor edx, edx vmovdqa xmm1, xmm2 vmovdqa xmm3, xmm2 vmovdqa xmm0, xmm2 .L3: lea rax, [rdi+64] lea rcx, [rdi+4096] .L2: vpaddq xmm0, xmm0, XMMWORD PTR [rax] vpaddq xmm3, xmm3, XMMWORD PTR [rax+16] add rax, 64 vpaddq xmm1, xmm1, XMMWORD PTR [rax-32] vpaddq xmm2, xmm2, XMMWORD PTR [rax-16] cmp rcx, rax jne .L2 mov rax, QWORD PTR [rdi+16] add rax, QWORD PTR [rdi+8] add rdx, rax add rdx, QWORD PTR [rdi+24] add rdx, QWORD PTR [rdi+32] add rdx, QWORD PTR [rdi+40] add rdx, QWORD PTR [rdi+48] add rdx, QWORD PTR [rdi+56] mov rdi, QWORD PTR [rdi] testrdi, rdi jne .L3 vpaddq xmm0, xmm0, xmm3 vpaddq xmm1, xmm1, xmm2 vpaddq xmm0, xmm0, xmm1 vmovq rcx, xmm0 vpextrq rax, xmm0, 1 add rax, rcx add rax, rdx ret
[Bug middle-end/108209] goof in genmatch.cc:commutative_op
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108209 --- Comment #1 from Alexander Monakov --- Keeping notes as I go... Duplicated checks for 'op0' in lower_for are duplicated.
[Bug middle-end/108209] New: goof in genmatch.cc:commutative_op
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108209 Bug ID: 108209 Summary: goof in genmatch.cc:commutative_op Product: gcc Version: 13.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: amonakov at gcc dot gnu.org Target Milestone: --- It pretends that define_operator_list is commutative when its first member is NOT commutative: if (user_id *uid = dyn_cast (id)) { int res = commutative_op (uid->substitutes[0]); if (res < 0) return 0; for (unsigned i = 1; i < uid->substitutes.length (); ++i) if (res != commutative_op (uid->substitutes[i])) return -1; return res; } The first 'return 0' should be 'return -1' instead.
[Bug rtl-optimization/108117] Wrong instruction scheduling on value coming from abnormal SSA
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108117 --- Comment #16 from Alexander Monakov --- Draft patch for the sched1 issue: https://inbox.sourceware.org/gcc-patches/cf62c3ec-0a9e-275e-5efa-2689ff1f0...@ispras.ru/T/#m95238afa0f92daa0ba7f8651741089e7cfc03481
[Bug middle-end/108140] ICE expanding __rbit
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108140 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org Keywords||ice-on-valid-code Component|c |middle-end Target||aarch64-*-* Summary|tzcnt gives different |ICE expanding __rbit |result in debug vs release | --- Comment #4 from Alexander Monakov --- When comment #0 says "this crashes at -O2", it means ICE in expand for the '__rbit' intrinsic on this testcase, which is reproducible on 12.2 and trunk: #include #include int main(int argc, char *argv[]) { unsigned long long input = argc-1; unsigned long long v = __clz(__rbit(input)); printf("%d %d\n", argc, v >= 64 ? 123 : 456); } I've edited the bug title to reflect this.
[Bug rtl-optimization/57067] Missing control flow edges for setjmp/longjmp
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57067 --- Comment #9 from Alexander Monakov --- *** Bug 108117 has been marked as a duplicate of this bug. ***
[Bug rtl-optimization/108117] Wrong instruction scheduling on value coming from abnormal SSA
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108117 Alexander Monakov changed: What|Removed |Added Resolution|FIXED |DUPLICATE --- Comment #15 from Alexander Monakov --- Sorry, didn't mean to remove the duplicate info. I could swear I didn't touch the dropdown, not sure what happened. *** This bug has been marked as a duplicate of bug 57067 ***
[Bug rtl-optimization/108117] Wrong instruction scheduling on value coming from abnormal SSA
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108117 Alexander Monakov changed: What|Removed |Added Resolution|DUPLICATE |FIXED --- Comment #14 from Alexander Monakov --- (In reply to Andrew Pinski from comment #13) > > The lifetime of the pseduo was already across the call ... Hm, I disagree: 'vb = 1' is a killing definition. Therefore the 'vb = 0' initialization is dead at the point of the call.
[Bug rtl-optimization/108117] Wrong instruction scheduling on value coming from abnormal SSA
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108117 --- Comment #12 from Alexander Monakov --- Shouldn't there be another bug for the sched1 issue specifically? In absence of abnormal control flow, extending lifetimes of pseudos across calls is still likely to be a pessimization.
[Bug tree-optimization/108129] New: nop_atomic_bit_test_and_p is too bloated
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108129 Bug ID: 108129 Summary: nop_atomic_bit_test_and_p is too bloated Product: gcc Version: 13.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: amonakov at gcc dot gnu.org Target Milestone: --- match.pd has multi-pattern matcher 'nop_atomic_bit_test_and_p'. It expands to ~38 KLOC in gimple-match.cc and ~350 KB in the compiled binary. There has to be a better way than repeatedly emitting the match pattern for each member of {ATOMIC,SYNC}_FETCH_{AND,OR_XOR}_N :)
[Bug rtl-optimization/108117] Wrong instruction scheduling on value coming from abnormal SSA
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108117 --- Comment #9 from Alexander Monakov --- (In reply to Feng Xue from comment #8) > In another angle, because gcc already model control flow and SSA web for > setjmp/longjmp, explicit volatile specification is not really needed. That covers GIMPLE, but after transitioning to RTL, setjmp is not properly modeled anymore (like in old versions of GCC before Tree-SSA). Many RTL passes simply refuse touching the function if it has a setjmp call, but as your example demonstrated, scheduling still can make a surprising transform.
[Bug rtl-optimization/108117] Wrong instruction scheduling on value coming from abnormal SSA
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108117 Alexander Monakov changed: What|Removed |Added Status|RESOLVED|UNCONFIRMED Resolution|INVALID |--- --- Comment #5 from Alexander Monakov --- On further thought, this is really an invalid transform, because the value becomes "clobbered" only if it was changed between setjmp and longjmp: (C11 7.13.2.1 "The longjmp function") > All accessible objects have values, and all other components of the abstract > machine have state, as of the time the longjmp function was called, except > that > the values of objects of automatic storage duration that are local to the > function containing the invocation of the corresponding setjmp macro that > do not have volatile-qualified type and have been changed between the setjmp > invocation and longjmp call are indeterminate. In the testcase, the assignment 'vb = 1' did not happen in the abstract machine. Moving back to UNCONFIRMED, both because the transform is invalid, and because lifting assignments to pseudos across calls in sched1 seems useless if not harmful to performance and code size. (that said, the -Wclobbered diagnostic still points to a potential issue, so it shouldn't be ignored)
[Bug rtl-optimization/108117] Wrong instruction scheduling on value coming from abnormal SSA
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108117 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #4 from Alexander Monakov --- -Wclobbered properly warns here (and it's part of -Wextra). With explicit -fschedule-insns, reproducible on x86 as well. The reason for the issue is quite surprising though, I did not expect pre-RA scheduling to lift assignments to pseudos across calls, because it just increases register pressure at the point of the call for little or no gain.
[Bug tree-optimization/108076] [10/11/12/13 Regression] GCC with -O3 produces code which fails to link
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108076 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org Summary|GCC with -O3 produces code |[10/11/12/13 Regression] |which fails to link |GCC with -O3 produces code ||which fails to link Known to work||8.5.0 Component|c |tree-optimization Keywords||link-failure --- Comment #2 from Alexander Monakov --- GIMPLE if-conversion seems to delete BBs with address-taken labels; works with -fno-tree-loop-if-convert
[Bug tree-optimization/108008] [12 Regression] wrong code with -O3 and posix_memalign
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108008 --- Comment #10 from Alexander Monakov --- Looks similar to PR 107323, but needs explicit -ftree-loop-distribution to trigger.
[Bug tree-optimization/108008] [12 Regression] wrong code with -O3 and posix_memalign
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108008 --- Comment #9 from Alexander Monakov --- I think this is tree-ldist placing memset(sameZ, 0, zPlaneCount) after the loop, overwriting conditional 'sameZ[i] = true' assignments that happen in the loop. For the smaller testcase from comment #6, -O2 -ftree-loop-distribution is enough, namely: works: gcc-12 -O2 -ftree-loop-distribution -fno-tree-vectorize -fno-tree-loop-distribute-patterns breaks: gcc-12 -O2 -ftree-loop-distribution -fno-tree-vectorize
[Bug target/87832] AMD pipeline models are very costly size-wise
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87832 --- Comment #11 from Alexander Monakov --- Factoring out Lujiazui divider shrinks its tables by almost 20x: 3 r lujiazui_decoder_min_issue_delay 20 r lujiazui_decoder_transitions 32 r lujiazui_agu_min_issue_delay 126 r lujiazui_agu_transitions 304 r lujiazui_div_base 352 r lujiazui_div_check 352 r lujiazui_div_transitions 1152 r lujiazui_core_min_issue_delay 1592 r lujiazui_agu_translate 1592 r lujiazui_core_translate 1592 r lujiazui_decoder_translate 1592 r lujiazui_div_translate 3952 r lujiazui_div_min_issue_delay 9216 r lujiazui_core_transitions
[Bug c++/108008] Compiler mis-optimization with posix_memalign
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108008 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #1 from Alexander Monakov --- -fno-split-loops "cures" it (of course it might just be an enabling transform for an incorrect optimization later on) Bisecting trunk for which commit fixes/hides it may be useful.
[Bug c/107971] linking an assembler object creates an executable stack
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107971 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #5 from Alexander Monakov --- The warning is new in binutils-2.39 (the latest release at this time), perhaps your linker is older.
[Bug tree-optimization/107879] [13 Regression] ffmpeg-4 test suite fails on FPU arithmetics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107879 --- Comment #10 from Alexander Monakov --- If anyone is confused like I was, the commit actually includes a testcase, but the addition is not mentioned in the Changelog. I was sure the server-side receive hook was supposed to reject such incomplete Changelog, though?
[Bug middle-end/107905] 2x slowdown versus CLANG and ICL
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107905 --- Comment #6 from Alexander Monakov --- Let me add that Clang supports GCC's -fprofile-{generate,use} flags for compatibility as well.
[Bug middle-end/107905] 2x slowdown versus CLANG and ICL
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107905 --- Comment #5 from Alexander Monakov --- Not sure what you don't like about the inputs, they appear quite reasonable. Perhaps GCC's estimation of bb frequencies is off (with profile feedback we achieve good performance). Georgi: you'll likely see better results with profile-guided optimization. You can first compile the benchmark with -O2 -fprofile-generate, run the output (it will generate *.gcda files), then compile again with -O2 -fprofile-use. For Clang the options are spelled -fprofile-instr-generate and -fprofile-instr-use, respectively.
[Bug driver/107787] -Werror=array-bounds=X does not work as expected
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107787 Alexander Monakov changed: What|Removed |Added Status|NEW |RESOLVED CC||amonakov at gcc dot gnu.org Resolution|--- |FIXED --- Comment #3 from Alexander Monakov --- Fixed for gcc-13.
[Bug middle-end/107905] 2x slowdown versus CLANG and ICL
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107905 Alexander Monakov changed: What|Removed |Added Keywords|ra | CC||amonakov at gcc dot gnu.org --- Comment #3 from Alexander Monakov --- LLVM does a better job at code layout, and massively wins on the amount of executed branches (in particular unconditional jumps). With -fdisable-rtl-bbro gcc achieves a similar performance.
[Bug target/104688] gcc and libatomic can use SSE for 128-bit atomic loads on Intel and AMD CPUs with AVX
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688 --- Comment #26 from Alexander Monakov --- Sure, the right course of action seems to be to simply document that atomic types and built-ins are meant to be used on "common" (writeback) memory, and no guarantees can be given otherwise, because it would involve platform specifics (relaxed ordering of WC writes as you say; tearing by PCI bridges and device interfaces seems like another possible caveat).
[Bug target/104688] gcc and libatomic can use SSE for 128-bit atomic loads on Intel and AMD CPUs with AVX
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688 --- Comment #24 from Alexander Monakov --- (In reply to Peter Cordes from comment #23) > But at least on Linux, I don't think there's a way for user-space to even > ask for a page of WT or WP memory (or UC or WC). Only WB memory is easily > available without hacking the kernel. As far as I know, this is true on > other existing OSes. I think it's possible to get UC/WC mappings via a graphics/compute API (e.g. OpenGL, Vulkan, OpenCL, CUDA) on any OS if you get a mapping to device memory (and then CPU vendor cannot guarantee that 128b access won't tear because it might depend on downstream devices).
[Bug rtl-optimization/107772] function prologue generated even though it's only needed in an unlikely path
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107772 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #4 from Alexander Monakov --- You'll get better results from outlining a rare path manually: the prologue/epilogue won't be re-executed for each invocation of 'g': int g(int); __attribute__((noinline,cold)) static void f_slowpath(int* b, int* e) { switch (0) do { if (*b != 0) default: *b = g(*b); } while (++b != e); } void f(int* b, int* e) { for (; b != e; b++) if (*b != 0) { f_slowpath(b, e); return; } }
[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832 --- Comment #21 from Alexander Monakov --- (In reply to Michael_S from comment #19) > > Also note that 'vfnmadd231pd 32(%rdx,%rax), %ymm3, %ymm0' would be > > 'unlaminated' (turned to 2 uops before renaming), so selecting independent > > IVs for the two arrays actually helps on this testcase. > > Both 'vfnmadd231pd 32(%rdx,%rax), %ymm3, %ymm0' and 'vfnmadd231pd 32(%rdx), > %ymm3, %ymm0' would be turned into 2 uops. The difference is at which point in the pipeline. The latter goes through renaming as one fused uop. > Misuse of load+op is far bigger problem in this particular test case than > sub-optimal loop overhead. Assuming execution on Intel Skylake, it turns > loop that can potentially run at 3 clocks per iteration into loop of 4+ > clocks per iteration. Sorry, which assembler output this refers to? > But I consider it a separate issue. I reported similar issue in 97127, but > here it is more serious. It looks to me that the issue is not soluble within > existing gcc optimization framework. The only chance is if you accept my old > and simple advice - within inner loops pretend that AVX is RISC, i.e. > generate code as if load-op form of AVX instructions weren't existing. In bug 97127 the best explanation we have so far is we don't optimally handle the case where non-memory inputs of an fma are reused, so we can't combine a load with an fma without causing an extra register copy (PR 97127 comment 16 demonstrates what I mean). I cannot imagine such trouble arising with more common commutative operations like mul/add, especially with non-destructive VEX encoding. If you hit such examples, I would suggest to report them also, because their root cause might be different. In general load-op combining should be very helpful on x86, because it reduces the number of uops flowing through the renaming stage, which is one of the narrowest points in the pipeline.
[Bug middle-end/107879] [13 Regression] ffmpeg-4 test suite fails on FPU arithmetics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107879 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #1 from Alexander Monakov --- Yes, thanks for the report. OK with -fno-tree-dominator-opts. The dom2/dom3 passes duplicate most of the computations in build_filter for the 'x == 0' branch, but the phi node in the resulting basic block 5 incorrectly receives 0.0 (from bb 6) as the value of 'ffm' computed on the duplicated path (should be 1.0).
[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #18 from Alexander Monakov --- The apparent 'bias' is introduced by instruction scheduling: haifa-sched lifts a +64 increment over memory accesses, transforming +0 and +32 displacements to -64 and -32. Sometimes this helps a little bit even on modern x86 CPUs. Also note that 'vfnmadd231pd 32(%rdx,%rax), %ymm3, %ymm0' would be 'unlaminated' (turned to 2 uops before renaming), so selecting independent IVs for the two arrays actually helps on this testcase.
[Bug tree-optimization/107647] [12/13 Regression] GCC 12.2.0 may produce FMAs even with -ffp-contract=off
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107647 --- Comment #15 from Alexander Monakov --- I'm confused about the first hunk in the attached patch: --- a/gcc/tree-vect-slp-patterns.cc +++ b/gcc/tree-vect-slp-patterns.cc @@ -1035,8 +1035,10 @@ complex_mul_pattern::matches (complex_operation_t op, auto_vec left_op, right_op; slp_tree add0 = NULL; - /* Check if we may be a multiply add. */ + /* Check if we may be a multiply add. It's only valid to form FMAs + with -ffp-contract=fast. */ if (!mul0 + && flag_fp_contract_mode != FP_CONTRACT_FAST && vect_match_expression_p (l0node[0], PLUS_EXPR)) { auto vals = SLP_TREE_CHILDREN (l0node[0]); Shouldn't it be ' == FP_CONTRACT_FAST' rather than '!='? It seems we are checking that a match is found and contracting across statement boundaries is allowed.
[Bug middle-end/107719] 14% regression on TSVC s3113 on znve4 compared to GCC 7.5
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107719 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #1 from Alexander Monakov --- As you say, the inner loop is the same, and it iterates 32000 times. Most likely it crosses an instruction fetch boundary differently, try -falign-loops=32.
[Bug target/87832] AMD pipeline models are very costly size-wise
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87832 --- Comment #10 from Alexander Monakov --- (In reply to Jan Hubicka from comment #9) > Actually for older cores I think the manufacturers do not care much. I > still have a working Bulldozer machine and I can do some testing. > I think in Buldozer case I was basing the latency throughput on data in > Agner Fog's manuals. Ahhh, how could I forget that his manuals have data for those cores too. Thanks for the reminder! This solves the conundrum nicely: AMD Jaguar ('btver2' in GCC): int/fp division is not pipelined, separate int/fp dividers; AMD Bulldozer, Steamroller ('bdver1', 'bdver3'): int division is not pipelined (one divider), fp division is slightly pipelined (two independent dividers); Zhaoxin Lujiazui appears to use the same divider as VIA Nano 3000, which is not pipelined. So it's already enough to produce a decent patch. > How do you test it? For AMD Zen patches I was using measurements by Andreas Abel ( https://uops.info/table_overview.html ) and running a few experiments myself by coding loops in NASM and timing them with 'perf stat' on a Zen 2 CPU.
[Bug tree-optimization/107715] TSVC s161 for double runs at zen4 30 times slower when vectorization is enabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107715 --- Comment #3 from Alexander Monakov --- There's a forward dependency over 'c' (read of c[i] vs. write of c[i+1] with 'i' iterating forward), and the vectorized variant takes the hit on each iteration. How is a slowdown even surprising. For the non-vectorized variant you have at most 50% iterations waiting on the previous, when 'b' has positive and negative elements in alternation, but the generator doesn't elicit this worst case.
[Bug target/87832] AMD pipeline models are very costly size-wise
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87832 --- Comment #8 from Alexander Monakov --- (In reply to Jan Hubicka from comment #7) > > 53730 r btver2_fp_min_issue_delay > > 53760 r znver1_fp_transitions > > 93960 r bdver3_fp_transitions > > 106102 r lujiazui_core_check > > 106102 r lujiazui_core_transitions > > 196123 r lujiazui_core_min_issue_delay > > > > What shall we do with similar blowups in lujiazui and b[dt]ver[123] models? > Yes, I think that makes sense... Do you mean we should fix modeling of divisions there as well? I don't have latency/throughput measurements for those CPUs, nor access so I can run experiments myself, unfortunately. I guess you mean just making a patch to model division units separately, leaving latency/throughput as in current incorrect models, and leave it to manufacturers to correct it? Alternatively, for AMD Bobcat and Bulldozer we might be able to crowd-source it eventually.
[Bug target/87832] AMD pipeline models are very costly size-wise
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87832 --- Comment #6 from Alexander Monakov --- With these patches on trunk, current situation is: nm -CS -t d --defined-only gcc/insn-automata.o | sed 's/^[0-9]* 0*//' | sort -n | tail -40 2496 r slm_base 2527 r bdver3_load_min_issue_delay 2746 r glm_base 3892 r bdver1_fp_base r bdver1_ieu_min_issue_delay 4492 r geode_base 4608 r bdver3_ieu_transitions 6402 r bdver1_load_transitions 6720 r znver1_fp_min_issue_delay 7862 r athlon_fp_check 7862 r athlon_fp_transitions 9122 r lujiazui_core_base 9997 t internal_insn_latency(int, int, rtx_insn*, rtx_insn*) 10108 r bdver3_load_transitions 10498 r geode_check 10498 r geode_transitions 11632 r print_reservation(_IO_FILE*, rtx_insn*)::reservation_names 12575 r athlon_fp_min_issue_delay 12742 r btver2_fp_check 12742 r btver2_fp_transitions 13896 r slm_check 13896 r slm_transitions 17149 t internal_min_issue_delay(int, DFA_chip*) 17349 t internal_state_transition(int, DFA_chip*) 17776 r bdver1_ieu_transitions 20068 r bdver1_fp_check 20068 r bdver1_fp_transitions 26208 r slm_min_issue_delay 27244 r bdver1_fp_min_issue_delay 28518 r glm_check 28518 r glm_transitions 33690 r geode_min_issue_delay 46980 r bdver3_fp_min_issue_delay 49428 r glm_min_issue_delay 53730 r btver2_fp_min_issue_delay 53760 r znver1_fp_transitions 93960 r bdver3_fp_transitions 106102 r lujiazui_core_check 106102 r lujiazui_core_transitions 196123 r lujiazui_core_min_issue_delay What shall we do with similar blowups in lujiazui and b[dt]ver[123] models?
[Bug target/107676] Nonsensical docs for -mrelax-cmpxchg-loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107676 Alexander Monakov changed: What|Removed |Added Status|NEW |RESOLVED CC||amonakov at gcc dot gnu.org Resolution|--- |FIXED --- Comment #8 from Alexander Monakov --- Fixed.
[Bug target/104688] gcc and libatomic can use SSE for 128-bit atomic loads on Intel and AMD CPUs with AVX
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688 --- Comment #15 from Alexander Monakov --- Ah, there will be an mfence after the vmovdqa when necessary for an atomic store, thanks (I missed that because the testcase doesn't scan for mfence).
[Bug target/104688] gcc and libatomic can use SSE for 128-bit atomic loads on Intel and AMD CPUs with AVX
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #13 from Alexander Monakov --- Jakub, sorry if I misunderstood the patches from a brief glance, but what ordering guarantees are you assuming for AVX accesses? It should not be SEQ_CST. I think what Intel manual is saying is that said accessing will not tear, but reordering is the same as pre-existing x86 TSO rules (a load can finish before an earlier store is globally visible).
[Bug tree-optimization/107647] [12/13 Regression] GCC 12.2.0 may produce FMAs even with -ffp-contract=off
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107647 --- Comment #6 from Alexander Monakov --- Sure, but I was talking specifically about the pattern matching introduced by that commit.
[Bug tree-optimization/107647] [12/13 Regression] GCC 12.2.0 may produce FMAs even with -ffp-contract=off
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107647 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #3 from Alexander Monakov --- Nice catch, thanks for the report. This is due to g:7d810646d421 The documentation should clarify that patterns correspond to basic fma instructions (without intermediate rounding), and SLP pattern matching should check flag_fp_contract_mode != FP_CONTRACT_OFF.
[Bug other/107621] spinx generated documents has too much white space on the top
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107621 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #5 from Alexander Monakov --- The unnecessary empty space appears due to some subresources not loading as a result of a Content-Security-Policy issue: https://inbox.sourceware.org/gcc-patches/5ea2ef7e-4b89-272f-c8e1-f3874c9fa...@pfeifer.com/T/#m5acd422ef000b9758206cb186fe62d6244b8cd47
[Bug tree-optimization/107505] [13 Regression] ICE: verify_flow_info failed (error: returns_twice call is not first in basic block 2)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107505 Alexander Monakov changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #4 from Alexander Monakov --- (In reply to Richard Biener from comment #2) > That looks about correct - patch is OK if testing succeeds. Thanks, fixed.
[Bug target/87832] AMD pipeline models are very costly size-wise
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87832 --- Comment #3 from Alexander Monakov --- Followup patches have been posted at https://inbox.sourceware.org/gcc-patches/20221101162637.14238-1-amona...@ispras.ru/
[Bug tree-optimization/107505] [13 Regression] ICE: verify_flow_info failed (error: returns_twice call is not first in basic block 2)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107505 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #1 from Alexander Monakov --- Thanks. This is tree-ssa-sink relocating the call after 'zero' is discovered to be const, so I think the fix may be as simple as diff --git a/gcc/tree-ssa-sink.cc b/gcc/tree-ssa-sink.cc index 921305201..631fc88c3 100644 --- a/gcc/tree-ssa-sink.cc +++ b/gcc/tree-ssa-sink.cc @@ -266,11 +266,11 @@ statement_sink_location (gimple *stmt, basic_block frombb, /* We only can sink assignments and non-looping const/pure calls. */ int cf; if (!is_gimple_assign (stmt) && (!is_gimple_call (stmt) || !((cf = gimple_call_flags (stmt)) & (ECF_CONST|ECF_PURE)) - || (cf & ECF_LOOPING_CONST_OR_PURE))) + || (cf & (ECF_LOOPING_CONST_OR_PURE|ECF_RETURNS_TWICE return false; /* We only can sink stmts with a single definition. */ def_p = single_ssa_def_operand (stmt, SSA_OP_ALL_DEFS); if (def_p == NULL_DEF_OPERAND_P)
[Bug other/107353] frontends sometimes select wrong (too strong) TLS access model
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107353 Alexander Monakov changed: What|Removed |Added Summary|[13 regression] Numerous|frontends sometimes select |ICEs after |wrong (too strong) TLS |r13-3416-g1d561e1851c466|access model --- Comment #15 from Alexander Monakov --- C FE issue was broken out as PR 107419 and Fortran FE issue as PR 107421, which now "block" this PR together with PR 107393 for the earlier C++ testcase. The offending assert is gone, so retitling (not a regression anymore).
[Bug fortran/107421] New: problematic interaction of 'common' and 'threadprivate'
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107421 Bug ID: 107421 Summary: problematic interaction of 'common' and 'threadprivate' Product: gcc Version: 13.0 Status: UNCONFIRMED Keywords: openmp Severity: normal Priority: P3 Component: fortran Assignee: unassigned at gcc dot gnu.org Reporter: amonakov at gcc dot gnu.org CC: amonakov at gcc dot gnu.org, asolokha at gmx dot com, bergner at gcc dot gnu.org, iains at gcc dot gnu.org, law at gcc dot gnu.org, marxin at gcc dot gnu.org, segher at gcc dot gnu.org, seurer at gcc dot gnu.org, unassigned at gcc dot gnu.org Blocks: 107353 Target Milestone: --- +++ This bug was initially created as a clone of Bug #107353 +++ integer :: i common /c/ i !$omp threadprivate (/c/) i = 0 end f951 -fopenmp invokes decl_default_tls_model before assigning DECL_COMMON in fortran/trans-common.cc:build_common_decl. This causes 'c' to have local-exec model rather than initial-exec, breaking internal verification that was weakened to solve PR 107353. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107353 [Bug 107353] [13 regression] Numerous ICEs after r13-3416-g1d561e1851c466
[Bug c/107419] New: attributes are ignored when selecting TLS model
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107419 Bug ID: 107419 Summary: attributes are ignored when selecting TLS model Product: gcc Version: 13.0 Status: UNCONFIRMED Keywords: wrong-code Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: amonakov at gcc dot gnu.org CC: amonakov at gcc dot gnu.org, asolokha at gmx dot com, bergner at gcc dot gnu.org, iains at gcc dot gnu.org, law at gcc dot gnu.org, marxin at gcc dot gnu.org, segher at gcc dot gnu.org, seurer at gcc dot gnu.org, unassigned at gcc dot gnu.org Blocks: 107353 Target Milestone: --- +++ This bug was initially created as a clone of Bug #107353 +++ __attribute__((common)) __thread int i; int *f() { return } C frontend invokes decl_default_tls_model before processing attributes, assigning local-exec model as if the 'common' attribute was not present. Recomputing it later would select initial-exec model, breaking internal verification that was weakened to solve PR 107353. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107353 [Bug 107353] [13 regression] Numerous ICEs after r13-3416-g1d561e1851c466
[Bug other/107353] [13 regression] Numerous ICEs after r13-3416-g1d561e1851c466
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107353 --- Comment #13 from Alexander Monakov --- As for the Fortran testcases, the issue is again caused by the front-end invoking decl_default_tls_model before assigning DECL_COMMON, this time in fortran/trans-common.cc:build_common_decl. So I guess I can be happy that the assert uncovered issues in three front-ends, and adjust the code to avoid downgrading TLS model instead of asserting: diff --git a/gcc/ipa-visibility.cc b/gcc/ipa-visibility.cc index 3ed2b7cf6..bb86005e5 100644 --- a/gcc/ipa-visibility.cc +++ b/gcc/ipa-visibility.cc @@ -886,8 +886,8 @@ function_and_variable_visibility (bool whole_program) && vnode->ref_list.referring.length ()) { enum tls_model new_model = decl_default_tls_model (decl); - gcc_checking_assert (new_model >= decl_tls_model (decl)); - set_decl_tls_model (decl, new_model); + if (new_model >= decl_tls_model (decl)) + set_decl_tls_model (decl, new_model); } } }
[Bug other/107353] [13 regression] Numerous ICEs after r13-3416-g1d561e1851c466
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107353 --- Comment #12 from Alexander Monakov --- ICE on the emutls-3.c testcase isn't related to emutls. Rather, the frontend invokes decl_default_tls_model before attributes are processed, so the first time around we miss the 'common' attribute when deciding the TLS access model. The following cut-down testcase fails on x86 as well with -m32 -fpie: __attribute__((common)) __thread int i; int *f() { return } Before the offending commit GCC compiled 'f' as if the attribute was ignored. (on ELF targets combining TLS and COMMON is problematic if not undefined)
[Bug other/107353] [13 regression] Numerous ICEs after r13-3416-g1d561e1851c466
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107353 --- Comment #11 from Alexander Monakov --- I've broken out the C++ issue from comment #10 as PR 107393, thanks for the testcase. It's a separate issue from emutls and Fortran ICEs on other targets.
[Bug c++/107393] New: Wrong TLS model for specialized template
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107393 Bug ID: 107393 Summary: Wrong TLS model for specialized template Product: gcc Version: 13.0 Status: UNCONFIRMED Keywords: wrong-code Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: amonakov at gcc dot gnu.org CC: amonakov at gcc dot gnu.org, asolokha at gmx dot com, bergner at gcc dot gnu.org, iains at gcc dot gnu.org, law at gcc dot gnu.org, marxin at gcc dot gnu.org, segher at gcc dot gnu.org, seurer at gcc dot gnu.org, unassigned at gcc dot gnu.org Blocks: 107353 Target Milestone: --- +++ This bug was initially created as a clone of Bug #107353 +++ template struct S { static __thread int i; }; template __thread int S::i; extern template __thread int S::i; int () { return S::i; } int () { return S::i; } Current trunk ICEs due to a new verification in ipa-visibility, before that gcc -O2 used to emit: _Z2viv: movq%fs:0, %rax addq$_ZN1SIvE1iE@tpoff, %rax ret _Z2civ: movq%fs:0, %rax addq$_ZN1SIcE1iE@tpoff, %rax ret _ZN1SIcE1iE: .zero 4 which incorrectly uses local-exec model to retrieve S::i, which is extern (and thus could reside in a shared library at link time, not the executable being linked). Clang correctly uses initial-exec for S::i and local-exec for S::i. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107353 [Bug 107353] [13 regression] Numerous ICEs after r13-3416-g1d561e1851c466
[Bug other/107353] [13 regression] Numerous ICEs after r13-3416-g1d561e1851c466
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107353 --- Comment #9 from Alexander Monakov --- Actually, latest results from H.J. Lu's periodic x86_64 tester don't exhibit such issues either: https://inbox.sourceware.org/gcc-testresults/20221025065901.6dc0062...@gnu-34.sc.intel.com/T/#u
[Bug other/107353] [13 regression] Numerous ICEs after r13-3416-g1d561e1851c466
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107353 --- Comment #8 from Alexander Monakov --- (In reply to Arseny Solokha from comment #7) > I have it on x86_64-pc-linux-gnu… Thanks for the info (I assume you don't have any special configure arguments), but that's surprising, I ran bootstrap+regtest before committing the patch, and did not see such issues. I'll recheck with today's trunk.
[Bug other/107353] [13 regression] Numerous ICEs after r13-3416-g1d561e1851c466
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107353 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #6 from Alexander Monakov --- I can start investigating the root cause later today. In the meantime, please supply the usual reproduction info if possible (configure arguments and preprocessed source where applicable). Presumably powerpc64le doesn't use emutls, so there might be two issues. FWIW, I don't understand why I was not Cc'ed on this bug, especially if adding the main author turned out to be a problem. The commit message gives my email twice, as a co-author and as the committer, and it's conveniently hyperlinked from comment 0.
[Bug target/87832] AMD pipeline models are very costly size-wise
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87832 --- Comment #1 from Alexander Monakov --- Suggested partial fix for the integer-pipe side of the blowup: https://inbox.sourceware.org/gcc-patches/4549f27b-238a-7d77-f72b-cc77df8ae...@ispras.ru/
[Bug middle-end/102380] [meta-bug] visibility (fvisibility=* and attributes) issues
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102380 Bug 102380 depends on bug 99619, which changed state. Bug 99619 Summary: fails to infer local-dynamic TLS model from hidden visibility https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99619 What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |FIXED
[Bug middle-end/99619] fails to infer local-dynamic TLS model from hidden visibility
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99619 Alexander Monakov changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |FIXED --- Comment #2 from Alexander Monakov --- Fixed for gcc-13.
[Bug target/107250] Load unnecessarily happens before malloc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107250 --- Comment #3 from Alexander Monakov --- Well, obviously because in one function both 'f' and 'tmp' are live across the call, and in the other function only 'f' is live across the call. The difference is literally pushing one register vs. two registers, plus extra 8 bytes to preserve 16-byte ABI alignment.
[Bug tree-optimization/107250] Load unnecessarily happens before malloc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107250 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #1 from Alexander Monakov --- On the other hand, dispatching the load before malloc is useful if you expect it to miss in the caches. If you wrote the code with that in mind, and the compiler moved the load anyway, a manual workaround to *that* would be more invasive.
[Bug middle-end/107115] Wrong codegen from TBAA under stores that change effective type?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107115 --- Comment #12 from Alexander Monakov --- For reference, the previous whacked mole appears to be PR 106187 (where mems_same_for_tbaa_p comes from).
[Bug middle-end/107115] Wrong codegen from TBAA under stores that change effective type?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107115 --- Comment #8 from Alexander Monakov --- Just optimizing out the redundant store seems difficult because on some targets scheduling is invoked from reorg (and it relies on alias sets). We need a solution that works for combine too — is it possible to invent a representation for a no-op in-place MEM "move" that only changes its alias set?
[Bug middle-end/107115] Wrong codegen from TBAA under stores that change effective type?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107115 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org, ||jakub at gcc dot gnu.org --- Comment #6 from Alexander Monakov --- Cc'ing Jakub for the problematic i386.md peephole that loses alias set info. As Andrew mentioned in comment #2, the next stop is combine, which sees (set (mem:DI (plus:DI (mult:DI (sign_extend:DI (reg:SI 101)) (const_int 8 [0x8])) (reg/v/f:DI 90 [ p4 ])) [1 MEM[(long int *)_11]+0 S8 A64]) (mem:DI (plus:DI (mult:DI (sign_extend:DI (reg:SI 101)) (const_int 8 [0x8])) (reg/v/f:DI 90 [ p4 ])) [2 *_11+0 S8 A64])) as a no-op move and removes it (but note differing alias sets in the MEMs). And with -fdisable-rtl-combine it is then broken by peephole2 of all things: ;; Attempt to optimize away memory stores of values the memory already ;; has. See PR79593. (define_peephole2 [(set (match_operand 0 "register_operand") (match_operand 1 "memory_operand")) (set (match_operand 2 "memory_operand") (match_dup 0))] "!MEM_VOLATILE_P (operands[1]) && !MEM_VOLATILE_P (operands[2]) && rtx_equal_p (operands[1], operands[2]) && !reg_overlap_mentioned_p (operands[0], operands[2])" [(set (match_dup 0) (match_dup 1))])
[Bug tree-optimization/107107] [10/11/12/13 Regression] Wrong codegen from TBAA when stores to distinct same-mode types are collapsed?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107107 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #6 from Alexander Monakov --- (In reply to Andrew Pinski from comment #5) > (In reply to Rich Felker from comment #1) > > There's also a potentially related test case at > > https://godbolt.org/z/jfv1Ge6v4 - I'm not yet clear on whether it's likely > > to have the same root cause. > > This might be a different issue I think. Yeah, that's sched2 reordering the accesses (probably cselib is confused). Needs a separate report.
[Bug tree-optimization/107099] New: uncprop a bit
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107099 Bug ID: 107099 Summary: uncprop a bit Product: gcc Version: 13.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: amonakov at gcc dot gnu.org Target Milestone: --- For the following testcase #include __attribute__((target("avx"))) int f(__m128i a[], long n) { for (long i = 0; i < n; i++) if (!_mm_testz_si128(a[i], a[i])) return 0; return 1; } gcc -O2 generates f: testrsi, rsi jle .L4 xor eax, eax jmp .L3 .L10: add rax, 1 cmp rsi, rax je .L4 .L3: mov rdx, rax sal rdx, 4 vmovdqa xmm0, XMMWORD PTR [rdi+rdx] xor edx, edx vptest xmm0, xmm0 setedl je .L10 mov eax, edx ret .L4: mov edx, 1 mov eax, edx ret Note the redundant assignments to edx in the loop and compare with gcc -O2 -fdisable-tree-uncprop1 Also note that generally uncprop adds a data dependency where only a control dependency existed, hurting speculative execution (hence more appropriate for -Os than -O2).
[Bug target/106902] [11/12/13 Regression] Program compiled with -O3 -mfma produces different result
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106902 --- Comment #19 from Alexander Monakov --- (In reply to rguent...@suse.de from comment #18) > True - but does that catch the cases people are interested and are > allowed by the FP contraction rules? I'm thinking of > > x = a*b + c*d + e + f; > > with -fassociative-math we can form two FMAs here? Yes; it might be reasonable to limit the match.pd rule to -fno-associative-math, leaving mul/adds as-is for tree-ssa-math-opts to recombine otherwise. > Of course with > strict IEEE compliance but allowed FP contraction we can only > do FMA (a, b, c*d) + e + f, right? I think so. > Does that mean -ffp-contract=on > only makes sense in absence of any other -ffast-math flags? Well, the proposal was to make -ffp-contract=fast an '-ffast-math' flag, not =on. I don't want to judge if '-ffp-contract=on -ffast-math' combination is reasonable or not, because -ffast-math by itself quite nonsensical already.
[Bug target/106902] [11/12/13 Regression] Program compiled with -O3 -mfma produces different result
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106902 --- Comment #17 from Alexander Monakov --- (In reply to Richard Biener from comment #16) > I do think that since the only way to > preserve expression boundaries is by PAREN_EXPR Yes, but... > that the middle-end > shouldn't care about FAST vs. ON (well, it cannot), but the language > frontends need to ensure to emit PAREN_EXPRs for =ON and omit them for > =FAST. this will also prevent reassociation across statements too. Doing FMA contraction in the frontends via a match.pd rule doesn't have this drawback.
[Bug target/106902] [11/12/13 Regression] Program compiled with -O3 -mfma produces different result
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106902 --- Comment #15 from Alexander Monakov --- (In reply to Richard Biener from comment #14) > I can't > seem to reproduce any vectorization for your smaller example though. My small C samples omit some detail as they were meant to illustrate what happened in the IR. Is that a problem? By the way, I noticed that tree-ssa-math-opts incorrectly handles -ffp-contract: if (FLOAT_TYPE_P (type) && flag_fp_contract_mode == FP_CONTRACT_OFF) return false; It should be 'flag_fp_contract_mode != FP_CONTRACT_FAST' instead (the pass doesn't have any idea about expression boundaries). It dates back to g:1694907238eb
[Bug lto/107014] flatten+lto fails the kernel build
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107014 --- Comment #7 from Alexander Monakov --- I wanted to understand what gets exposed in LTO mode that causes a blowup. I'd say flatten is not appropriate for this function (I don't think you want to force inlining of memset or _find_next_bit?), so might be better to go back to the original issue and solve the problem in a more focused way (e.g. force-inlining the function which needs to access __initdata if you really need the verification that triggers otherwise).
[Bug lto/107014] flatten+lto fails the kernel build
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107014 --- Comment #5 from Alexander Monakov --- (In reply to Jiri Slaby from comment #4) > > I am surprised that "flatten" blows up on this function. Is that with any > > config, or again some specific settings like gcov? Is there an existing lkml > > thread about this? > > Yes, linked in the commit log: > https://lore.kernel.org/all/ > cak8p3a2zwfnexksm8k_suhhwkor17jfo3xaplxjzfpqx0eu...@mail.gmail.com/ I mean now, about compile time blowup with LTO.
[Bug lto/107014] flatten+lto fails the kernel build
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107014 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #3 from Alexander Monakov --- It was added to force inlining of small helpers that outgrow limits when building with gcov profiling: https://github.com/torvalds/linux/commit/258e0815e2b1706e87c0d874211097aa8a7aa52f (lack of inlining triggered a sanity check, as explained in the commit) I am surprised that "flatten" blows up on this function. Is that with any config, or again some specific settings like gcov? Is there an existing lkml thread about this?
[Bug target/106902] [11/12/13 Regression] Program compiled with -O3 -mfma produces different result
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106902 --- Comment #13 from Alexander Monakov --- (In reply to Richard Biener from comment #12) > > Isn't it easy now to implement -ffp-contract=on by a GENERIC-only match.pd > > rule? > > You mean in the frontend only for -ffp-contract=on? Yes. > Maybe, I suppose FE > specific folding would also work in that case. One would also need to read > the fine prints in the language standards again as to whether FP contraction > allows to form FMA for > > double tem = a * b; > double res = tem + c; > > or across inlined function call boundaries which we'll happily do. In C contraction is allowed only within an expression (hence a difference between -ffp-contract=fast vs. -ffp-contract=on). The original testcase was in C++, I think C++ does not specify it, but hopefully we'd aim to implement the same semantics as for C. > Of course for the testcase at hand it's all in > a single statement and no parens specify association (in case parens also > matter here, like in Fortran). The fortran frontend adds PAREN_EXPRs > as association barriers which also would prevent FMAs to be formed. Please note that in this testcase GCC is breaking language semantics by computing the same value in two different ways, and then using different computed values in dependent computations. This could not have happened in the abstract machine (there's a singular assignment in the original program, which is then used in subsequent iterations of the loop).
[Bug target/106902] [11/12/13 Regression] Program compiled with -O3 -mfma produces different result
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106902 --- Comment #11 from Alexander Monakov --- Can we move -ffp-contract=fast under the -ffast-math umbrella and default to -ffp-contract=on/off? Isn't it easy now to implement -ffp-contract=on by a GENERIC-only match.pd rule?
[Bug target/106952] Missed optimization: x < y ? x : y not lowered to minss
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106952 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #1 from Alexander Monakov --- Note, your 'max' function is the same as 'min' (the issue remains with that corrected).
[Bug target/106902] [11/12/13 Regression] Program compiled with -O3 -mfma produces different result
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106902 --- Comment #7 from Alexander Monakov --- Lawrence, thank you for the nice work reducing the testcase. For RawTherapee the recommended course of action would be to compile everything with -ffp-contract=off, then manually reintroduce use of fma in performance-sensitive places by testing the FP_FAST_FMA macro to know if hardware fma is available. This way you'll know that all systems without fma get the same results, and all systems with fma also get the same results (but different from the former). For example, my function 'f1' could be adapted like this: void f1(void) { double x1 = 0, x2 = 0, x3 = 0; for (int i = 0; i < 99; ) { double t; #ifdef FP_FAST_FMA t = fma(x1, b1, fma(x2, b2, fma(x3, b3, B * one))); #else t = B * one + x1 * b1 + x2 * b2 + x3 * b3; #endif printf("%d %g\t%a\n", i++, t, t); x3 = x2, x2 = x1, x1 = t; } }
[Bug target/106902] [11/12/13 Regression] Program compiled with -O3 -mfma produces different result
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106902 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #6 from Alexander Monakov --- This is a lovely showcase how optimizations cooperatively produce something unexpected. TL;DR: SLP introduces redundant computations and then fma formation contracts some (but not all) of those, dramatically reducing numerical stability. In principle that's similar to incorrectly "optimizing" double f(double x) { double y = x * x; return y - y; } (which is guaranteed to return either NaN or 0) to double f(double x) { return fma(x, x, -(x * x)); } which returns the round-off tail of x * x (or NaN). I think there's already another bug with a similar root cause. In this bug, we begin with (note, all following examples are supposed to be compiled without fma contraction, i.e. -O0, plain -O2, or -O2 -ffp-contract=off if your target has fma): #include #include double one = 1; double b1 = 0x1.70e906b54fe4fp+1; double b2 = -0x1.62adb4752c14ep+1; double b3 = 0x1.c7001a6f3bd8p-1; double B = 0x1.29c9034e7cp-13; void f1(void) { double x1 = 0, x2 = 0, x3 = 0; for (int i = 0; i < 99; ) { double t = B * one + x1 * b1 + x2 * b2 + x3 * b3; printf("%d %g\t%a\n", i++, t, t); x3 = x2, x2 = x1, x1 = t; } } predcom unrolls by 3 to get rid of moves: void f2(void) { double x1 = 0, x2 = 0, x3 = 0; for (int i = 0; i < 99; ) { x3 = B * one + x1 * b1 + x2 * b2 + x3 * b3; printf("%d %g\t%a\n", i++, x3, x3); x2 = B * one + x3 * b1 + x1 * b2 + x2 * b3; printf("%d %g\t%a\n", i++, x2, x2); x1 = B * one + x2 * b1 + x3 * b2 + x1 * b3; printf("%d %g\t%a\n", i++, x1, x1); } } SLP introduces some redundant vector computations: typedef double f64v2 __attribute__((vector_size(16))); void f3(void) { double x1 = 0, x2 = 0, x3 = 0; f64v2 x32 = { 0 }, x21 = { 0 }; for (int i = 0; i < 99; ) { x3 = B * one + x21[1] * b1 + x2 * b2 + x3 * b3; f64v2 x13b1 = { x21[1] * b1, x3 * b1 }; x32 = B * one + x13b1 + x21 * b2 + x32 * b3; x2 = B * one + x3 * b1 + x1 * b2 + x2 * b3; f64v2 x13b2 = { b2 * x1, b2 * x32[0] }; x21 = B * one + x32 * b1 + x13b2 + x21 * b3; x1 = B * one + x2 * b1 + x32[0] * b2 + x1 * b3; printf("%d %g\t%a\n", i++, x32[0], x32[0]); printf("%d %g\t%a\n", i++, x32[1], x32[1]); printf("%d %g\t%a\n", i++, x21[1], x21[1]); } } Note that this is still bit-identical to the initial function. But then tree-ssa-math-opts "randomly" forms some FMAs: f64v2 vfma(f64v2 x, f64v2 y, f64v2 z) { return (f64v2){ fma(x[0], y[0], z[0]), fma(x[1], y[1], z[1]) }; } void f4(void) { f64v2 vone = { one, one }, vB = { B, B }; f64v2 vb1 = { b1, b1 }, vb2 = { b2, b2 }, vb3 = { b3, b3 }; double x1 = 0, x2 = 0, x3 = 0; f64v2 x32 = { 0 }, x21 = { 0 }; for (int i = 0; i < 99; ) { x3 = fma(b3, x3, fma(b2, x2, fma(B, one, x21[1] * b1))); f64v2 x13b1 = { x21[1] * b1, x3 * b1 }; x32 = vfma(vb3, x32, vfma(vb2, x21, vfma(vB, vone, x13b1))); x2 = fma(b3, x2, b2 * x1 + fma(B, one, x3 * b1)); f64v2 x13b2 = { b2 * x1, b2 * x32[0] }; x21 = vfma(vb3, x21, x13b2 + vfma(vB, vone, x32 * vb1)); x1 = fma(b3, x1, b2 * x32[0] + fma(B, one, b1 * x2)); printf("%d %g\t%a\n", i++, x32[0], x32[0]); printf("%d %g\t%a\n", i++, x32[1], x32[1]); printf("%d %g\t%a\n", i++, x21[1], x21[1]); } } and here some of the redundantly computed values are computed differently depending on where rounding after multiplication was omitted. Somehow this is enough to make the computation explode numerically.
[Bug lto/91299] [10/11/12/13 Regression] LTO inlines a weak definition in presence of a non-weak definition from an ELF file
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91299 Alexander Monakov changed: What|Removed |Added Keywords||wrong-code Summary|LTO inlines a weak |[10/11/12/13 Regression] |definition in presence of a |LTO inlines a weak |non-weak definition from an |definition in presence of a |ELF file|non-weak definition from an ||ELF file --- Comment #14 from Alexander Monakov --- gcc-4.9 used to get this right, so let's play the regression card? This should not be in WAITING.
[Bug target/106834] GCC creates R_X86_64_GOTOFF64 for 4-bytes immediate
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106834 --- Comment #10 from Alexander Monakov --- Okay, so this should have been reported against Binutils, but since we are having the conversation here: the current behavior is not good, gas is silently selecting a different relocation kind for no clear reason. Why is it not a warning or an error? Note that if you assemble such GOT reference via NASM: extern _GLOBAL_OFFSET_TABLE_ default rel f: mov rax, [_GLOBAL_OFFSET_TABLE_ wrt ..gotpc] ret then t.o has : 0: 48 8b 05 00 00 00 00mov0x0(%rip),%rax# 7 3: R_X86_64_GOTPCREL_GLOBAL_OFFSET_TABLE_-0x4 7: c3 ret and ld -shared --no-relax -o t.so t.o does not reject it and t.so has 1000 : 1000: 48 8b 05 f1 1f 00 00mov0x1ff1(%rip),%rax# 2ff8 <_DYNAMIC+0xe0> 1007: c3 ret and without --no-relax: 1000 : 1000: 48 8d 05 f9 1f 00 00lea0x1ff9(%rip),%rax# 3000 <_GLOBAL_OFFSET_TABLE_> 1007: c3 ret So I don't see the reason why it's special-cased in gas.
[Bug target/106453] Redundant zero extension after crc32q
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106453 Alexander Monakov changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #4 from Alexander Monakov --- Fixed for gcc-13.
[Bug c++/106834] GCC creates R_X86_64_GOTOFF64 for 4-bytes immediate
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106834 --- Comment #8 from Alexander Monakov --- Right, sorry, due to presence of 'main' I overlooked -fPIC in comment #0, and then after my prompt it got dropped in comment #3. If you modify the testcase as follows and compile it with -fPIC, it's evident that GCC is treating both external symbols the same, but gas does not. Similar to PR 106835, it seems Binutils is special-casing by symbol name. But here the situation is worse, because GCC output is mentioning the intended relocation kind: movq_GLOBAL_OFFSET_TABLE_@GOTPCREL(%rip), %rax so silently using R_X86_64_GOTOFF64 instead doesn't look right. #include extern char _GLOBAL_OFFSET_TABLE_[]; extern char xGLOBAL_OFFSET_TABLE_[]; int main() { printf("%lx", (unsigned long)_GLOBAL_OFFSET_TABLE_); printf("%lx", (unsigned long)xGLOBAL_OFFSET_TABLE_); }
[Bug c++/106834] GCC creates R_X86_64_GOTOFF64 for 4-bytes immediate
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106834 --- Comment #6 from Alexander Monakov --- (In reply to Martin Liška from comment #5) > Do you mean gas or ld? gas > How did you get this output, please (from foo.o or final executable)? >From foo.o like in comment #0.
[Bug c/106835] [i386] Taking an address of _GLOBAL_OFFSET_TABLE_ produces a wrong value
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106835 --- Comment #3 from Alexander Monakov --- It would be unfortunate if that makes it difficult or even impossible to make a R_386_32 relocation for the address of GOT in hand-written assembly. In any case, it seems GCC is not making the rules here, so this should be reported against Binutils so they can clarify the situation?
[Bug c++/106834] GCC creates R_X86_64_GOTOFF64 for 4-bytes immediate
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106834 Alexander Monakov changed: What|Removed |Added CC||hjl.tools at gmail dot com --- Comment #4 from Alexander Monakov --- Probably a Binutils bug then, with binutils-2.37 I get the correct 4: 48 8d 05 00 00 00 00lea0x0(%rip),%rax# b 7: R_X86_64_GOTPC32 _GLOBAL_OFFSET_TABLE_-0x4 Can you please report it against binutils at https://sourceware.org/bugzilla/ and mention the link here?
[Bug c/106835] [i386] Taking an address of _GLOBAL_OFFSET_TABLE_ produces a wrong value
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106835 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #1 from Alexander Monakov --- Surely this is a Binutils (assembler) bug? gcc emits ptr: .long _GLOBAL_OFFSET_TABLE_
[Bug c++/106834] GCC creates R_X86_64_GOTOFF64 for 4-bytes immediate
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106834 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #2 from Alexander Monakov --- Can you show how gcc -S output looks for you on this testcase? For me the problematic instruction is just movl$_GLOBAL_OFFSET_TABLE_, %eax or leaq_GLOBAL_OFFSET_TABLE_(%rip), %rax with -fpie, so it's the assembler who chooses the relocation type (which would make that a Binutils bug).
[Bug middle-end/106804] Poor codegen for selecting and incrementing value behind a reference
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106804 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #8 from Alexander Monakov --- (In reply to Richard Biener from comment #7) > In fact I'd say the reverse transformation is more profitable? In the end it depends on the context. It's a trade-off between a conditional branch and extra data dependencies feeding into the address of a store. If a branch is perfectly predictable, it's preferable. Otherwise, if there's no memory dependency via the store, you don't care about delaying it, making the branchless version preferable if that reduces pipeline flushes. If there is a dependency, it comes down to how often the branch mispredicts, I guess. /\ | People who tinker with compilers | | need __builtin_branchless_select | \/ \ \ \ .--. |o_o | | ~ | // \ \ (| | ) /'\_ _/`\ \___)=(___/
[Bug tree-optimization/106781] [13 Regression] ICE: verify_flow_info failed (error: returns_twice call is not first in basic block 2) since r13-1754-g7a158a5776f5ca95
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106781 --- Comment #5 from Alexander Monakov --- GCC discovers that 'bar' is noreturn, tries to remove its LHS but unfortunately cgraph.cc:cgraph_edge::redirect_call_stmt_to_callee wants to emit an assignment of SSA default-def to the LHS. fixup_noreturn_call seems to handle that in a smarter way. Is it possible to simply let fixup_noreturn_call do its thing? diff --git a/gcc/cgraph.cc b/gcc/cgraph.cc index 8d6ed38ef..6597de669 100644 --- a/gcc/cgraph.cc +++ b/gcc/cgraph.cc @@ -1567,7 +1567,7 @@ cgraph_edge::redirect_call_stmt_to_callee (cgraph_edge *e) /* If the call becomes noreturn, remove the LHS if possible. */ tree lhs = gimple_call_lhs (new_stmt); - if (lhs + if (0 && lhs && gimple_call_noreturn_p (new_stmt) && (VOID_TYPE_P (TREE_TYPE (gimple_call_fntype (new_stmt))) || should_remove_lhs_p (lhs)))
[Bug tree-optimization/106781] [13 Regression] ICE: verify_flow_info failed (error: returns_twice call is not first in basic block 2) since r13-1754-g7a158a5776f5ca95
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106781 --- Comment #4 from Alexander Monakov --- (In reply to Martin Liška from comment #3) > > Also ICEs in ipa-modref when 'noclone' added to 'noinline', a 12/13 > > regression (different cause, needs a separate PR). > > Can't reproduce Alexander, please attach a testcase. Ah, it ICEs when emitting a dump, so -fdump-tree-modref2 is needed in addition to -O2, I've filed that as PR 106783.
[Bug ipa/106783] New: [12/13 Regression] ICE in ipa-modref.cc:analyze_function
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106783 Bug ID: 106783 Summary: [12/13 Regression] ICE in ipa-modref.cc:analyze_function Product: gcc Version: 13.0 Status: UNCONFIRMED Keywords: ice-on-valid-code Severity: normal Priority: P3 Component: ipa Assignee: unassigned at gcc dot gnu.org Reporter: amonakov at gcc dot gnu.org CC: amonakov at gcc dot gnu.org, asolokha at gmx dot com, marxin at gcc dot gnu.org, unassigned at gcc dot gnu.org Target Milestone: --- +++ This bug was initially created as a clone of Bug #106781 +++ ICEs when emitting a tree dump with -O2 -fdump-tree-modref2 int n; __attribute__ ((noinline,noclone,returns_twice)) static int bar (int) { n /= 0; return n; } int foo (int x) { return bar (x); } t.c: In function ‘foo’: t.c:12:1: internal compiler error: in analyze_function, at ipa-modref.cc:3286 12 | foo (int x) | ^~~ 0x10e548e analyze_function gcc/ipa-modref.cc:3286 0x10e83b5 execute gcc/ipa-modref.cc:4186 Note that -fdump-tree-modref2 is needed. It reaches a gcc_unreachable(), I'd suggest to move the verification outside of dumping if possible, so the compiler doesn't ICE or not depending on whether dumping is requested.
[Bug tree-optimization/106781] [13 Regression] ICE: verify_flow_info failed (error: returns_twice call is not first in basic block 2)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106781 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #1 from Alexander Monakov --- Thanks. Also ICEs in ipa-modref when 'noclone' added to 'noinline', a 12/13 regression (different cause, needs a separate PR).
[Bug middle-end/106688] New: leaving SSA emits assignment into the inner loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106688 Bug ID: 106688 Summary: leaving SSA emits assignment into the inner loop Product: gcc Version: 13.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: amonakov at gcc dot gnu.org Target Milestone: --- For the following testcase, gcc -O2 unsigned foo(const unsigned char *buf, long size); unsigned bar(const unsigned char *buf, long size) { typedef char i8v8 __attribute__((vector_size(8))); typedef short i16v8 __attribute__((vector_size(16))); long chunk_sz = 15*16; for (; size >= chunk_sz; size -= chunk_sz) { i16v8 vs1 = { 0 }; const unsigned char *end = buf + chunk_sz; for (; buf != end; buf += 16) { i16v8 b; asm("pmovzxbw %1, %0" : "=x"(b) : "m"(*(i8v8*)buf)); vs1 += b; asm("pmovzxbw %1, %0" : "=x"(b) : "m"(*(i8v8*)(buf+8))); vs1 += b; } asm("" :: "x"(vs1)); } return foo(buf, size); } (asms needed due to PR 31667) generates bar: cmp rsi, 239 jle .L2 lea rdx, [rdi+240] .L4: lea rax, [rdx-240] pxorxmm0, xmm0 .L3: pmovzxbw QWORD PTR [rax], xmm1 add rax, 16 paddw xmm0, xmm1 mov rdi, rdx ; <<< ehhh pmovzxbw QWORD PTR [rax-8], xmm1 paddw xmm0, xmm1 cmp rax, rdx jne .L3 sub rsi, 240 add rdx, 240 cmp rsi, 239 jg .L4 .L2: jmp foo It looks as if going out of SSA places in the loop a register copy corresponding to a phi node which is outside of the loop. Strangely, RTL optimizations do not clean it up either.
[Bug rtl-optimization/106553] pre-register allocation scheduler is now RMW aware
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106553 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #1 from Alexander Monakov --- Are you sure the testcase is correctly reduced, i.e. does it show the same performance degradation? Latency-wise the scheduler is making the correct decision here: we really want to schedule second-to-last FMA y = v_fma_f32 (y, r2, x); earlier than its predecessor r = v_fma_f32 (y, r2, z); because we need to compute y*r2 before the last FMA.
[Bug middle-end/106470] Subscribed access to __m256i casted to (uint16_t *) produces garbage or a warning
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106470 --- Comment #8 from Alexander Monakov --- But that's the point of many warnings, isn't it? To help the user understand what's wrong when the code is bad? And bogus warnings just confuse more.
[Bug middle-end/106470] Subscribed access to __m256i casted to (uint16_t *) produces garbage or a warning
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106470 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #6 from Alexander Monakov --- Andrew, surely the bogus -Wuninitialized warning is a GCC bug here?