[Bug rtl-optimization/110215] RA fails to allocate register when loop invariant lives across calls and eh
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110215 --- Comment #7 from CVS Commits --- The master branch has been updated by Vladimir Makarov : https://gcc.gnu.org/g:a99f6bb142bc4506dcb8aa2b7722310ad92e4528 commit r14-5294-ga99f6bb142bc4506dcb8aa2b7722310ad92e4528 Author: Vladimir N. Makarov Date: Thu Nov 9 08:51:15 2023 -0500 [IRA]: Fixing conflict calculation from region landing pads. The following patch fixes conflict calculation from exception landing pads. The previous patch processed only one newly created landing pad. Besides it was wrong, it also resulted in large memory consumption by IRA. gcc/ChangeLog: PR rtl-optimization/110215 * ira-lives.cc: (add_conflict_from_region_landing_pads): New function. (process_bb_node_lives): Use it.
[Bug rtl-optimization/110215] RA fails to allocate register when loop invariant lives across calls and eh
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110215 --- Comment #6 from Hongyu Wang --- Thanks for the fix, now for the attached test, main loop will not have any load. There is a remaining issue that the loop epilogue still contains load from stack and constant pool .L9: movslq %edx, %rax movss 72(%rsp), %xmm5 salq$2, %rax leaq(%rbx,%rax), %rcx movaps %xmm5, %xmm1 subss (%rcx), %xmm1 andps .LC4(%rip), %xmm1 movss %xmm1, (%rcx) leal1(%rdx), %ecx addss %xmm1, %xmm0 cmpl%ecx, %r12d jle .L8 IRA dump shows the pseudos does not have conflict but they still failed to be allocated with register. This issue does not exist on aarch64.
[Bug rtl-optimization/110215] RA fails to allocate register when loop invariant lives across calls and eh
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110215 --- Comment #5 from CVS Commits --- The master branch has been updated by Vladimir Makarov : https://gcc.gnu.org/g:154c69039571c66b3a6d16ecfa9e6ff22942f59f commit r14-1891-g154c69039571c66b3a6d16ecfa9e6ff22942f59f Author: Vladimir N. Makarov Date: Fri Jun 16 11:12:32 2023 -0400 RA: Ignore conflicts for some pseudos from insns throwing a final exception IRA adds conflicts to the pseudos from insns can throw exceptions internally even if the exception code is final for the function and the pseudo value is not used in the exception code. This results in spilling a pseudo in a loop (see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110215). The following patch fixes the problem. PR rtl-optimization/110215 gcc/ChangeLog: * ira-lives.cc: Include except.h. (process_bb_node_lives): Ignore conflicts from cleanup exceptions when the pseudo does not live at the exception landing pad.
[Bug rtl-optimization/110215] RA fails to allocate register when loop invariant lives across calls and eh
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110215 --- Comment #4 from Vladimir Makarov --- (In reply to Richard Biener from comment #3) > > > We don't have any pass after reload that would perform loop invatiant motion, > I'm not sure how this situation is handled in general in RA - is a post-RA > pass optimizing the spill/reload placement "globally" usually done? LRA does not do placement of reload insns. Global RA is supposed to do this when it forms regions for the allocation. I've been working on this issue. I hope the fix will be ready on this week.
[Bug rtl-optimization/110215] RA fails to allocate register when loop invariant lives across calls and eh
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110215 Richard Biener changed: What|Removed |Added CC||vmakarov at gcc dot gnu.org Keywords|EH | --- Comment #3 from Richard Biener --- The issue is that we fail to sink d_29 = {t_28, t_28, t_28 t_28}; we compute a good place in select_best_block but then since it is at the same loop depth as the original place we apply /* If BEST_BB is at the same nesting level, then require it to have significantly lower execution frequency to avoid gratuitous movement. */ if (bb_loop_depth (best_bb) == bb_loop_depth (early_bb) /* If result of comparsion is unknown, prefer EARLY_BB. Thus use !(...>=..) rather than (...<...) */ && !(best_bb->count * 100 >= early_bb->count * threshold)) return best_bb; and fail to sink. I'm not exactly sure why we do the above - we probably should when best_bb post-dominates early_bb, also if the sunk stmt possibly (or provably) will enlarge lifetime of its uses (but that's also hard to guess since we process sinking of the defs of the uses only afterwards). In this case we have a single use and a single def so sinking shouldn't make things worse. We could also weight in spilling class of a reg here. In our case we have the dominated block with a higher(!) count than the dominating block which means the profile is corrupt. With --param sink-frequency-threshold we sink the ctor and the feeding division but still get .L5: movq(%rbx), %rax pxor%xmm1, %xmm1 leaq0(%rbp,%rax), %rdx .p2align 4,,10 .p2align 3 .L4: movaps (%rsp), %xmm0 addps (%rax), %xmm0 addq$16, %rax movaps %xmm0, -16(%rax) addps %xmm0, %xmm1 cmpq%rax, %rdx jne .L4 movaps %xmm1, %xmm0 movhlps %xmm1, %xmm0 addps %xmm0, %xmm1 movaps %xmm1, %xmm0 shufps $85, %xmm1, %xmm0 addps %xmm1, %xmm0 .LEHB1: call_Z1gf addq$8, %rbx cmpq%rbx, %r12 jne .L5 because we (rightfully so) refuse to sink into the outer loop. What we fail to do is hoist the reload out of the inner loop (I suppose clang does exactly that). We don't have any pass after reload that would perform loop invatiant motion, I'm not sure how this situation is handled in general in RA - is a post-RA pass optimizing the spill/reload placement "globally" usually done?
[Bug rtl-optimization/110215] RA fails to allocate register when loop invariant lives across calls and eh
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110215 Andrew Pinski changed: What|Removed |Added Status|UNCONFIRMED |NEW Summary|RA fails to allocate|RA fails to allocate |register when loop |register when loop |invariant lives across |invariant lives across |calls |calls and eh Ever confirmed|0 |1 Keywords||ra Last reconfirmed||2023-06-12 --- Comment #2 from Andrew Pinski --- Reduced testcase for both x86_64 and aarch64: ``` #define vec __attribute__((vector_size(4*sizeof(float struct s1 { s1(); ~s1(); }; void g(); void g(float); void f(float a, float b, vec float **c, int n, int j) { s1 t2; float t = a/b; vec float d = {t, t, t, t}; for (int l = 0; l < j; l++) { vec float s = {}; for(int i =0;i