Re: [PATCH][GCC][AArch64]: Break apart paradoxical subregs for VSTRUCT writes (PR target/94052)

Richard Sandiford Tue, 10 Mar 2020 04:50:04 -0700

Tamar Christina <[email protected]> writes:
> Hi All,
>
> This works around an ICE in reload where from expand we get the following RTL
> generated for VSTRUCT mode writes:
>
> (insn 446 354 445 2 (set (reg:CI 383)
>  (subreg:CI (reg:V4SI 291) 0)) "small.i":146:22 3408 {*aarch64_movci}
>  (nil))
>
> This sequence is trying to say two things:
>
> 1) liveliness: It's trying to say that eventually the whole CI reg will be
>              written to. It does this by generating the paradoxical subreg.
> 2) write data: It's trying to in the same instruction also write the V4SI mode
>              component at offset 0 in the CI reg.
>
> Reload is unable to understand this concept and so it attempts to handle this
> instruction by breaking apart the instruction, first writing the data and then
> tries to reload the paradoxical part.  This gets it to the same instruction
> again and eventually we ICE since we reach the limit of no. reloads.


reload/LRA does understand the concept.  It just isn't handling this
particular case very well :-)

The pre-RA insn is:

(insn 210 218 209 6 (set (reg/v:OI 182 [ vres ])
        (subreg:OI (reg:V4SI 289) 0)) "pr94052.C":157:31 3518 {*aarch64_movoi}
     (expr_list:REG_DEAD (reg:V4SI 289)
        (nil)))

IRA allocates a hard register to r182 but not r289:

  126:r182 l0    32  ...
  ...
  129:r289 l0   mem  ...

This is because memory appears to be cheaper:

  a129(r289,l2) costs: FP_LO8_REGS:136,136 FP_LO_REGS:136,136 FP_REGS:136,136 
MEM:110,110

That's probably a bug in itself (possibly in the target costs), but it
shouldn't be a correctness issue.

LRA then handles insn 210 like this:

      Creating newreg=317, assigning class ALL_REGS to slow/invalid mem r317
      Creating newreg=318, assigning class ALL_REGS to slow/invalid mem r318
  210: r182:OI=r318:OI
      REG_DEAD r289:V4SI
    Inserting slow/invalid mem reload before:
  355: r317:V4SI=[sfp:DI+0x60]
  356: r318:OI=r317:V4SI#0

            1 Non pseudo reload: reject++
          alt=0,overall=1,losers=0,rld_nregs=0
         Choosing alt 0 in insn 210:  (0) =w  (1) w {*aarch64_movoi}
      Change to class FP_REGS for r318

That looks OK so far, given the allocation.  But later we have:

********** Assignment #2: **********

         Assigning to 318 (cl=FP_REGS, orig=318, freq=44, tfirst=318, 
tfreq=44)...
           Assign 32 to subreg reload r318 (freq=44)
         Assigning to 317 (cl=ALL_REGS, orig=317, freq=44, tfirst=317, 
tfreq=44)...
           Assign 0 to subreg reload r317 (freq=44)

So LRA is assigning a GPR (x0) to the new V4SI register r317.  It's this
allocation that induces the cycling, because we get:

           Cycle danger: overall += LRA_MAX_REJECT
          alt=0,overall=606,losers=1,rld_nregs=2
            0 Spill pseudo into memory: reject+=3
            Using memory insn operand 0: reject+=3
            0 Non input pseudo reload: reject++
            Cycle danger: overall += LRA_MAX_REJECT
          alt=1,overall=619,losers=2,rld_nregs=2
            1 Spill pseudo into memory: reject+=3
            Using memory insn operand 1: reject+=3
          alt=2,overall=12,losers=1,rld_nregs=0
         Choosing alt 2 in insn 356:  (0) w  (1) Utv {*aarch64_movoi}
      Creating newreg=319, assigning class NO_REGS to r319
  356: r318:OI=r319:OI
      REG_DEAD r317:V4SI
    Inserting insn reload before:
  357: r319:OI=r317:V4SI#0

            0 Non input pseudo reload: reject++
          alt=0,overall=13,losers=2,rld_nregs=4
            0 Non pseudo reload: reject++
          alt=1,overall=7,losers=1,rld_nregs=2
            0 Non input pseudo reload: reject++
            1 Spill pseudo into memory: reject+=3
            Using memory insn operand 1: reject+=3
            alt=2,overall=19,losers=2 -- refuse
         Choosing alt 1 in insn 357:  (0) Utv  (1) w {*aarch64_movoi}
      Creating newreg=320, assigning class FP_REGS to r320
  357: r319:OI=r320:OI
    Inserting insn reload before:
  358: r320:OI=r317:V4SI#0

and we keep oscillating between those two choices (alt 2 vs alt 1).
This wouldn't have happened if we'd allocated an FPR instead.

I think the problem here is that we're always trying to reload the
subreg rather than the inner register, even though the allocation for
the inner register isn't valid for the subreg.  There is code in
simplify_operand_subreg to detect this kind of situation, but it
seems to be missing a check for hard_regno_mode_ok.  The first patch
below seems to fix that.

> This patch fixes it by in the backend when we see such a paradoxical
> construction breaking it apart and issuing a clobber to correct the liveliness
> information and then emitting a normal subreg write for the component that the
> paradoxical subreg was trying to write to.
>
> Concretely we generate this:
>
> (insn 42 41 43 (clobber (reg/v:CI 122 [ diD.5226 ])) "small.i":121:23 -1
>      (nil))
>
> (insn 43 42 44 (set (subreg:V4SI (reg/v:CI 122 [ diD.5226 ]) 0)
>         (reg:V4SI 136)) "small.i":121:23 -1
>      (nil))
>
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>
> Ok for master and back-port to GCC 9 and GCC 8 after some stew?
>
> I will look into seeing if we can not generate these at all, but I'm not sure
> this is possible since the mid-end would need both the Mode and the Class to
> know that a pseudo will be assigned to multiple hardregs.

Subreg-based register liveness divides the register into
REGMODE_NATURAL_SIZE chunks (traditionally word-sized chunks).
E.g. read_modify_subreg_p checks whether a particular subreg
preserves part of the old value or not.  I think it's these rules
that matter here.

The second patch below adds that check to emit_group_store and generates
slightly better code than the paradoxical version did.

Both patches fix the bug independently, although the second is more
of a workaround than a fix.  Unfortunately I messed up overnight testing
so they've only been tested on aarch64-linux-gnu so far.  I'll try again
tonight...

I think we should aim to fix this in target-independent code for master,
but I agree that's probably too invasive to backport and so the AArch64
workaround might be more appropriate for branches.

> +  /* If we have a paradoxical subreg trying to write to <MODE> from and the
> +     registers don't overlap then we need to break it apart.  What it's 
> trying
> +     to do is give two kind of information at the same time.  It's trying to
> +     convey liveness information by saying that the entire register will be
> +     written to eventually, but it also only wants to write a single part of 
> the
> +     register.  Hence the paradoxical subreg.
> +
> +     However reload doesn't understand this concept and it will ultimately 
> ICE.
> +     Instead of allowing this we will split the two concerns.  The liveness
> +     information will be conveyed using a clobber and then we break apart the
> +     paradoxical subreg into just a normal write of the part that it wanted 
> to
> +     write originally.  */

As above, we should avoid saying that reload doesn't understand
the concept.  We're just working around a specific bug.

> +  if (paradoxical_subreg_p (operands[1]))
> +    {
> +      if (!reg_overlap_mentioned_p (operands[0], operands[1]))
> +     emit_clobber (operands[0]);
> +      poly_uint64 offset = SUBREG_BYTE (operands[1]);
> +      operands[1] = SUBREG_REG (operands[1]);
> +      operands[0] = simplify_gen_subreg (GET_MODE (operands[1]), operands[0],
> +                                      <MODE>mode, offset);
> +    }

SUBREG_BYTE is always 0 for paradoxical subregs, so this will do the
wrong thing for big-endian.  It's probably better to use gen_lowpart
to get the destination.

We should probably also check that operands[0] is a REG.  In practice
it probably always will be at this point if the source is a SUBREG,
but it doesn't follow automatically, and given that this is supposed
to be a safe backport...

FWIW, another option would be to make aarch64_preferred_reload_class
return FP_REGS for 128-bit vector modes, if the given class is a
superset of FP_REGS.  On the one hand that feels cleaner: there's a
good argument that we should be doing that anyway, since it means
we can reload in Q registers rather than pairs of X registers.
But on the other hand, it feels less safe for backporting.  E.g. maybe
we'll find another corner case that fails if a GPR *isn't* chosen.

Thanks,
Richard

>From f75d209ab3c27fcfea915fe77f37fa4f973f9dd0 Mon Sep 17 00:00:00 2001
From: Richard Sandiford <[email protected]>
Date: Mon, 9 Mar 2020 19:42:57 +0000
Subject: [PATCH 1/2] lra: Tighten check for reloading paradoxical subregs
 [PR94052]

simplify_operand_subreg tries to detect whether the allocation for
a pseudo in a paradoxical subreg is also valid for the outer mode.
The condition it used to check for an invalid combination was:

  else if (REG_P (reg)
	   && REGNO (reg) >= FIRST_PSEUDO_REGISTER
	   && (hard_regno = lra_get_regno_hard_regno (REGNO (reg))) >= 0
	   && (hard_regno_nregs (hard_regno, innermode)
	       < hard_regno_nregs (hard_regno, mode))
	   && (regclass = lra_get_allocno_class (REGNO (reg)))
	   && (type != OP_IN
	       || !in_hard_reg_set_p (reg_class_contents[regclass],
				      mode, hard_regno)
	       || overlaps_hard_reg_set_p (lra_no_alloc_regs,
					   mode, hard_regno)))

I think there are two problems with this:

(1) It never actually checks whether the hard register is valid for the
    outer mode (in the hard_regno_mode_ok sense).  If it isn't, any attempt
    to reload in the outer mode is likely to cycle, because the implied
    regno/mode combination will be just as invalid next time
    curr_insn_transform sees the subreg.

(2) The check is valid for little-endian only.  For big-endian we need
    to move hard_regno backwards.

Using simplify_subreg_regno should avoid both problems.

As the existing comment says, IRA should always take subreg references
into account when allocating hard registers, so this fix-up should only
really be needed for pseudos allocated by LRA itself.

2020-03-10  Richard Sandiford  <[email protected]>

gcc/
	PR rtl-optimization/94052
	* lra-constraints.c (simplify_operand_subreg): Reload the inner
	register of a paradoxical subreg if simplify_subreg_regno fails
	to give a valid hard register for the outer mode.
---
 gcc/lra-constraints.c | 24 ++++++++++++++----------
 1 file changed, 14 insertions(+), 10 deletions(-)

diff --git a/gcc/lra-constraints.c b/gcc/lra-constraints.c
index f71e0c9ff8d..bf6d4a2fd4b 100644
--- a/gcc/lra-constraints.c
+++ b/gcc/lra-constraints.c
@@ -1489,7 +1489,7 @@ static bool process_address (int, bool, rtx_insn **, rtx_insn **);
 static bool
 simplify_operand_subreg (int nop, machine_mode reg_mode)
 {
-  int hard_regno;
+  int hard_regno, inner_hard_regno;
   rtx_insn *before, *after;
   machine_mode mode, innermode;
   rtx reg, new_reg;
@@ -1735,15 +1735,19 @@ simplify_operand_subreg (int nop, machine_mode reg_mode)
      for the new uses.  */
   else if (REG_P (reg)
 	   && REGNO (reg) >= FIRST_PSEUDO_REGISTER
-	   && (hard_regno = lra_get_regno_hard_regno (REGNO (reg))) >= 0
-	   && (hard_regno_nregs (hard_regno, innermode)
-	       < hard_regno_nregs (hard_regno, mode))
-	   && (regclass = lra_get_allocno_class (REGNO (reg)))
-	   && (type != OP_IN
-	       || !in_hard_reg_set_p (reg_class_contents[regclass],
-				      mode, hard_regno)
-	       || overlaps_hard_reg_set_p (lra_no_alloc_regs,
-					   mode, hard_regno)))
+	   && paradoxical_subreg_p (operand)
+	   && (inner_hard_regno = lra_get_regno_hard_regno (REGNO (reg))) >= 0
+	   && ((hard_regno
+		= simplify_subreg_regno (inner_hard_regno, innermode,
+					 SUBREG_BYTE (operand), mode)) < 0
+	       || ((hard_regno_nregs (inner_hard_regno, innermode)
+		    < hard_regno_nregs (hard_regno, mode))
+		   && (regclass = lra_get_allocno_class (REGNO (reg)))
+		   && (type != OP_IN
+		       || !in_hard_reg_set_p (reg_class_contents[regclass],
+					      mode, hard_regno)
+		       || overlaps_hard_reg_set_p (lra_no_alloc_regs,
+						   mode, hard_regno)))))
     {
       /* The class will be defined later in curr_insn_transform.  */
       enum reg_class rclass
-- 
2.17.1

>From 19ae0b272c7a191634249cdab90e5d9512b1474b Mon Sep 17 00:00:00 2001
From: Richard Sandiford <[email protected]>
Date: Tue, 10 Mar 2020 09:17:17 +0000
Subject: [PATCH 2/2] expand: Tweak paradoxical subreg handling in
 emit_group_store [PR94052]

emit_group_store normally tries to start the sequence by storing
a paradoxical subreg of one source value into the whole destination.
That seems reasonable when the destination is effectively a single
register, but as explained in the comment, it seems likely to make
things worse then the destination is bigger than that.

2020-03-10  Richard Sandiford  <[email protected]>

gcc/
	PR middle-end/94052
	* expr.c (emit_group_store): Clobber a multi-register destination
	before storing to it.
---
 gcc/expr.c | 31 +++++++++++++++++++++++++++++--
 1 file changed, 29 insertions(+), 2 deletions(-)

diff --git a/gcc/expr.c b/gcc/expr.c
index b97c217e86d..57529f25769 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -2489,7 +2489,9 @@ emit_group_store (rtx orig_dst, rtx src, tree type ATTRIBUTE_UNUSED,
   for (i = start; i < finish; i++)
     {
       rtx reg = XEXP (XVECEXP (src, 0, i), 0);
-      if (!REG_P (reg) || REGNO (reg) < FIRST_PSEUDO_REGISTER)
+      /* Checking for reg == dst (which is very unlikely to be true) ensures
+	 that we can emit a clobber of dst when dst is a pseudo register.  */
+      if (!REG_P (reg) || REGNO (reg) < FIRST_PSEUDO_REGISTER || reg == dst)
 	{
 	  tmps[i] = gen_reg_rtx (GET_MODE (reg));
 	  emit_move_insn (tmps[i], reg);
@@ -2530,11 +2532,36 @@ emit_group_store (rtx orig_dst, rtx src, tree type ATTRIBUTE_UNUSED,
       if (!REG_P (dst) || REGNO (dst) < FIRST_PSEUDO_REGISTER)
 	dst = gen_reg_rtx (outer);
 
+      /* Clobber the destination if pieces of it can be modified
+	 independently.  This ensures that we can assign to the individual
+	 pieces of dst without leaving dst upwards exposed.
+
+	 The alternatives below are less likely to be useful for this case.
+	 Creating a paradoxical subreg leads to moves like:
+
+	     (set (reg:TI X) (subreg:TI (reg:DI Y) 0))
+
+	 On a 64-bit target, this would normally be split into two DI moves
+	 after register allocation.  We'd then expect one of those DI moves
+	 to be removed as dead due to later assignments to parts of X.
+	 However, the instruction must still be reloaded as a TImode move
+	 and the allocation of Y must be valid for TImode as well as DImode,
+	 even though we only really want a single DImode move.
+
+	 If we set dst to zero instead, there's a risk that the
+	 redundancy will never be eliminated, particularly if the
+	 destination occupies more than two registers.  */
+      if (known_gt (GET_MODE_SIZE (outer), REGMODE_NATURAL_SIZE (outer)))
+	{
+	  emit_clobber (dst);
+	  done = 1;
+	}
+
       /* Make life a bit easier for combine.  */
       /* If the first element of the vector is the low part
 	 of the destination mode, use a paradoxical subreg to
 	 initialize the destination.  */
-      if (start < finish)
+      if (!done && start < finish)
 	{
 	  inner = GET_MODE (tmps[start]);
 	  bytepos = subreg_lowpart_offset (inner, outer);
-- 
2.17.1

Re: [PATCH][GCC][AArch64]: Break apart paradoxical subregs for VSTRUCT writes (PR target/94052)

Reply via email to