https://gcc.gnu.org/bugzilla/show_bug.cgi?id=124137

--- Comment #11 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Jakub Jelinek from comment #10)
> On the reduced testcase it certainly is r16-3290 though.

That makes sense, it introduces this code.

It seems to me that we should be able to compute safe TLS insertion BBs
upfront, in particular mark any block with a live caller-saved register
at its end.

Then iterating to the immediate dominator recursively until a an OK block
is found should be faster.

I'll note the code relies on REG_DEAD notes being present for incoming
register arguments moved to pseudos, but this doesn't seem to always
ensure insertion in BB 2 is possible.

In particular the code seems to ignore REG_UNUSED notes for FLAGS_REG,
but note_stores will happily set FLAGS_REG when it is clobbered, like for

(insn 807 805 6 2 (parallel [
            (set (subreg:SI (reg:HI 509) 0)
                (lshiftrt:SI (reg:SI 514)
                    (const_int 16 [0x10])))
            (clobber (reg:CC 17 flags))
        ])
"/home/packages/tmp/onednn-3.9.1+ds/src/cpu/x64/brgemm/jit_brgemm_amx_uker.cpp":1891:25
1213 {*lshrsi3_1}
     (expr_list:REG_UNUSED (reg:CC 17 flags)
        (expr_list:REG_DEAD (reg:SI 514)
            (nil))))

either clobbers should not mark FLAGS_REG live or REG_UNUSED should be
honored.  The latter sounds more conservative to me with my limited knowledge.

Then, without implementing the actual caching, removal of the attempt to
optimize the CFG walk fixes this bug:

diff --git a/gcc/config/i386/i386-features.cc
b/gcc/config/i386/i386-features.cc
index d5435f009cb..4baa6d3d064 100644
--- a/gcc/config/i386/i386-features.cc
+++ b/gcc/config/i386/i386-features.cc
@@ -4050,7 +4050,7 @@ ix86_emit_tls_call (rtx tls_set, x86_cse_kind kind,
basic_block bb,
              /* Place the call before all FLAGS_REG setting BBs since
                 we can't place a call before nor after a conditional
                 jump.  */
-             bb = ix86_get_dominator_for_reg (FLAGS_REG, bb);
+             bb = get_immediate_dominator (CDI_DOMINATORS, bb); //
ix86_get_dominator_for_reg (FLAGS_REG, bb);

              /* Start over again.  */
              repeat = true;
@@ -4080,7 +4080,9 @@ ix86_emit_tls_call (rtx tls_set, x86_cse_kind kind,
basic_block bb,

          rtx link;
          for (link = REG_NOTES (insn); link; link = XEXP (link, 1))
-           if (REG_NOTE_KIND (link) == REG_DEAD
+           if ((REG_NOTE_KIND (link) == REG_DEAD
+                || (REG_NOTE_KIND (link) == REG_UNUSED
+                    && REGNO (XEXP (link, 0)) == FLAGS_REG))
                && REG_P (XEXP (link, 0)))
              {
                /* Mark the live caller-saved register as dead.  */
@@ -4105,6 +4107,9 @@ ix86_emit_tls_call (rtx tls_set, x86_cse_kind kind,
basic_block bb,

       gcc_assert (!bitmap_empty_p (live_caller_saved_regs));

+      bb = get_immediate_dominator (CDI_DOMINATORS, bb);
+
+#if 0
       /* If any live caller-saved registers aren't dead at the end of
         this basic block, get the basic block which dominates all
         basic blocks which set the remaining live registers.  */
@@ -4117,6 +4122,7 @@ ix86_emit_tls_call (rtx tls_set, x86_cse_kind kind,
basic_block bb,
          bitmap_set_bit (set_bbs, set_bb->index);
        }
       bb = nearest_common_dominator_for_set (CDI_DOMINATORS, set_bbs);
+#endif
     }
   while (true);
 }


I'm going to test this.

Reply via email to