Tamar Christina <tamar.christ...@arm.com> writes: > Hi All, > > This works around an ICE in reload where from expand we get the following RTL > generated for VSTRUCT mode writes: > > (insn 446 354 445 2 (set (reg:CI 383) > (subreg:CI (reg:V4SI 291) 0)) "small.i":146:22 3408 {*aarch64_movci} > (nil)) > > This sequence is trying to say two things: > > 1) liveliness: It's trying to say that eventually the whole CI reg will be > written to. It does this by generating the paradoxical subreg. > 2) write data: It's trying to in the same instruction also write the V4SI mode > component at offset 0 in the CI reg. > > Reload is unable to understand this concept and so it attempts to handle this > instruction by breaking apart the instruction, first writing the data and then > tries to reload the paradoxical part. This gets it to the same instruction > again and eventually we ICE since we reach the limit of no. reloads.
reload/LRA does understand the concept. It just isn't handling this particular case very well :-) The pre-RA insn is: (insn 210 218 209 6 (set (reg/v:OI 182 [ vres ]) (subreg:OI (reg:V4SI 289) 0)) "pr94052.C":157:31 3518 {*aarch64_movoi} (expr_list:REG_DEAD (reg:V4SI 289) (nil))) IRA allocates a hard register to r182 but not r289: 126:r182 l0 32 ... ... 129:r289 l0 mem ... This is because memory appears to be cheaper: a129(r289,l2) costs: FP_LO8_REGS:136,136 FP_LO_REGS:136,136 FP_REGS:136,136 MEM:110,110 That's probably a bug in itself (possibly in the target costs), but it shouldn't be a correctness issue. LRA then handles insn 210 like this: Creating newreg=317, assigning class ALL_REGS to slow/invalid mem r317 Creating newreg=318, assigning class ALL_REGS to slow/invalid mem r318 210: r182:OI=r318:OI REG_DEAD r289:V4SI Inserting slow/invalid mem reload before: 355: r317:V4SI=[sfp:DI+0x60] 356: r318:OI=r317:V4SI#0 1 Non pseudo reload: reject++ alt=0,overall=1,losers=0,rld_nregs=0 Choosing alt 0 in insn 210: (0) =w (1) w {*aarch64_movoi} Change to class FP_REGS for r318 That looks OK so far, given the allocation. But later we have: ********** Assignment #2: ********** Assigning to 318 (cl=FP_REGS, orig=318, freq=44, tfirst=318, tfreq=44)... Assign 32 to subreg reload r318 (freq=44) Assigning to 317 (cl=ALL_REGS, orig=317, freq=44, tfirst=317, tfreq=44)... Assign 0 to subreg reload r317 (freq=44) So LRA is assigning a GPR (x0) to the new V4SI register r317. It's this allocation that induces the cycling, because we get: Cycle danger: overall += LRA_MAX_REJECT alt=0,overall=606,losers=1,rld_nregs=2 0 Spill pseudo into memory: reject+=3 Using memory insn operand 0: reject+=3 0 Non input pseudo reload: reject++ Cycle danger: overall += LRA_MAX_REJECT alt=1,overall=619,losers=2,rld_nregs=2 1 Spill pseudo into memory: reject+=3 Using memory insn operand 1: reject+=3 alt=2,overall=12,losers=1,rld_nregs=0 Choosing alt 2 in insn 356: (0) w (1) Utv {*aarch64_movoi} Creating newreg=319, assigning class NO_REGS to r319 356: r318:OI=r319:OI REG_DEAD r317:V4SI Inserting insn reload before: 357: r319:OI=r317:V4SI#0 0 Non input pseudo reload: reject++ alt=0,overall=13,losers=2,rld_nregs=4 0 Non pseudo reload: reject++ alt=1,overall=7,losers=1,rld_nregs=2 0 Non input pseudo reload: reject++ 1 Spill pseudo into memory: reject+=3 Using memory insn operand 1: reject+=3 alt=2,overall=19,losers=2 -- refuse Choosing alt 1 in insn 357: (0) Utv (1) w {*aarch64_movoi} Creating newreg=320, assigning class FP_REGS to r320 357: r319:OI=r320:OI Inserting insn reload before: 358: r320:OI=r317:V4SI#0 and we keep oscillating between those two choices (alt 2 vs alt 1). This wouldn't have happened if we'd allocated an FPR instead. I think the problem here is that we're always trying to reload the subreg rather than the inner register, even though the allocation for the inner register isn't valid for the subreg. There is code in simplify_operand_subreg to detect this kind of situation, but it seems to be missing a check for hard_regno_mode_ok. The first patch below seems to fix that. > This patch fixes it by in the backend when we see such a paradoxical > construction breaking it apart and issuing a clobber to correct the liveliness > information and then emitting a normal subreg write for the component that the > paradoxical subreg was trying to write to. > > Concretely we generate this: > > (insn 42 41 43 (clobber (reg/v:CI 122 [ diD.5226 ])) "small.i":121:23 -1 > (nil)) > > (insn 43 42 44 (set (subreg:V4SI (reg/v:CI 122 [ diD.5226 ]) 0) > (reg:V4SI 136)) "small.i":121:23 -1 > (nil)) > > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. > > Ok for master and back-port to GCC 9 and GCC 8 after some stew? > > I will look into seeing if we can not generate these at all, but I'm not sure > this is possible since the mid-end would need both the Mode and the Class to > know that a pseudo will be assigned to multiple hardregs. Subreg-based register liveness divides the register into REGMODE_NATURAL_SIZE chunks (traditionally word-sized chunks). E.g. read_modify_subreg_p checks whether a particular subreg preserves part of the old value or not. I think it's these rules that matter here. The second patch below adds that check to emit_group_store and generates slightly better code than the paradoxical version did. Both patches fix the bug independently, although the second is more of a workaround than a fix. Unfortunately I messed up overnight testing so they've only been tested on aarch64-linux-gnu so far. I'll try again tonight... I think we should aim to fix this in target-independent code for master, but I agree that's probably too invasive to backport and so the AArch64 workaround might be more appropriate for branches. > + /* If we have a paradoxical subreg trying to write to <MODE> from and the > + registers don't overlap then we need to break it apart. What it's > trying > + to do is give two kind of information at the same time. It's trying to > + convey liveness information by saying that the entire register will be > + written to eventually, but it also only wants to write a single part of > the > + register. Hence the paradoxical subreg. > + > + However reload doesn't understand this concept and it will ultimately > ICE. > + Instead of allowing this we will split the two concerns. The liveness > + information will be conveyed using a clobber and then we break apart the > + paradoxical subreg into just a normal write of the part that it wanted > to > + write originally. */ As above, we should avoid saying that reload doesn't understand the concept. We're just working around a specific bug. > + if (paradoxical_subreg_p (operands[1])) > + { > + if (!reg_overlap_mentioned_p (operands[0], operands[1])) > + emit_clobber (operands[0]); > + poly_uint64 offset = SUBREG_BYTE (operands[1]); > + operands[1] = SUBREG_REG (operands[1]); > + operands[0] = simplify_gen_subreg (GET_MODE (operands[1]), operands[0], > + <MODE>mode, offset); > + } SUBREG_BYTE is always 0 for paradoxical subregs, so this will do the wrong thing for big-endian. It's probably better to use gen_lowpart to get the destination. We should probably also check that operands[0] is a REG. In practice it probably always will be at this point if the source is a SUBREG, but it doesn't follow automatically, and given that this is supposed to be a safe backport... FWIW, another option would be to make aarch64_preferred_reload_class return FP_REGS for 128-bit vector modes, if the given class is a superset of FP_REGS. On the one hand that feels cleaner: there's a good argument that we should be doing that anyway, since it means we can reload in Q registers rather than pairs of X registers. But on the other hand, it feels less safe for backporting. E.g. maybe we'll find another corner case that fails if a GPR *isn't* chosen. Thanks, Richard
>From f75d209ab3c27fcfea915fe77f37fa4f973f9dd0 Mon Sep 17 00:00:00 2001 From: Richard Sandiford <richard.sandif...@arm.com> Date: Mon, 9 Mar 2020 19:42:57 +0000 Subject: [PATCH 1/2] lra: Tighten check for reloading paradoxical subregs [PR94052] simplify_operand_subreg tries to detect whether the allocation for a pseudo in a paradoxical subreg is also valid for the outer mode. The condition it used to check for an invalid combination was: else if (REG_P (reg) && REGNO (reg) >= FIRST_PSEUDO_REGISTER && (hard_regno = lra_get_regno_hard_regno (REGNO (reg))) >= 0 && (hard_regno_nregs (hard_regno, innermode) < hard_regno_nregs (hard_regno, mode)) && (regclass = lra_get_allocno_class (REGNO (reg))) && (type != OP_IN || !in_hard_reg_set_p (reg_class_contents[regclass], mode, hard_regno) || overlaps_hard_reg_set_p (lra_no_alloc_regs, mode, hard_regno))) I think there are two problems with this: (1) It never actually checks whether the hard register is valid for the outer mode (in the hard_regno_mode_ok sense). If it isn't, any attempt to reload in the outer mode is likely to cycle, because the implied regno/mode combination will be just as invalid next time curr_insn_transform sees the subreg. (2) The check is valid for little-endian only. For big-endian we need to move hard_regno backwards. Using simplify_subreg_regno should avoid both problems. As the existing comment says, IRA should always take subreg references into account when allocating hard registers, so this fix-up should only really be needed for pseudos allocated by LRA itself. 2020-03-10 Richard Sandiford <richard.sandif...@arm.com> gcc/ PR rtl-optimization/94052 * lra-constraints.c (simplify_operand_subreg): Reload the inner register of a paradoxical subreg if simplify_subreg_regno fails to give a valid hard register for the outer mode. --- gcc/lra-constraints.c | 24 ++++++++++++++---------- 1 file changed, 14 insertions(+), 10 deletions(-) diff --git a/gcc/lra-constraints.c b/gcc/lra-constraints.c index f71e0c9ff8d..bf6d4a2fd4b 100644 --- a/gcc/lra-constraints.c +++ b/gcc/lra-constraints.c @@ -1489,7 +1489,7 @@ static bool process_address (int, bool, rtx_insn **, rtx_insn **); static bool simplify_operand_subreg (int nop, machine_mode reg_mode) { - int hard_regno; + int hard_regno, inner_hard_regno; rtx_insn *before, *after; machine_mode mode, innermode; rtx reg, new_reg; @@ -1735,15 +1735,19 @@ simplify_operand_subreg (int nop, machine_mode reg_mode) for the new uses. */ else if (REG_P (reg) && REGNO (reg) >= FIRST_PSEUDO_REGISTER - && (hard_regno = lra_get_regno_hard_regno (REGNO (reg))) >= 0 - && (hard_regno_nregs (hard_regno, innermode) - < hard_regno_nregs (hard_regno, mode)) - && (regclass = lra_get_allocno_class (REGNO (reg))) - && (type != OP_IN - || !in_hard_reg_set_p (reg_class_contents[regclass], - mode, hard_regno) - || overlaps_hard_reg_set_p (lra_no_alloc_regs, - mode, hard_regno))) + && paradoxical_subreg_p (operand) + && (inner_hard_regno = lra_get_regno_hard_regno (REGNO (reg))) >= 0 + && ((hard_regno + = simplify_subreg_regno (inner_hard_regno, innermode, + SUBREG_BYTE (operand), mode)) < 0 + || ((hard_regno_nregs (inner_hard_regno, innermode) + < hard_regno_nregs (hard_regno, mode)) + && (regclass = lra_get_allocno_class (REGNO (reg))) + && (type != OP_IN + || !in_hard_reg_set_p (reg_class_contents[regclass], + mode, hard_regno) + || overlaps_hard_reg_set_p (lra_no_alloc_regs, + mode, hard_regno))))) { /* The class will be defined later in curr_insn_transform. */ enum reg_class rclass -- 2.17.1
>From 19ae0b272c7a191634249cdab90e5d9512b1474b Mon Sep 17 00:00:00 2001 From: Richard Sandiford <richard.sandif...@arm.com> Date: Tue, 10 Mar 2020 09:17:17 +0000 Subject: [PATCH 2/2] expand: Tweak paradoxical subreg handling in emit_group_store [PR94052] emit_group_store normally tries to start the sequence by storing a paradoxical subreg of one source value into the whole destination. That seems reasonable when the destination is effectively a single register, but as explained in the comment, it seems likely to make things worse then the destination is bigger than that. 2020-03-10 Richard Sandiford <richard.sandif...@arm.com> gcc/ PR middle-end/94052 * expr.c (emit_group_store): Clobber a multi-register destination before storing to it. --- gcc/expr.c | 31 +++++++++++++++++++++++++++++-- 1 file changed, 29 insertions(+), 2 deletions(-) diff --git a/gcc/expr.c b/gcc/expr.c index b97c217e86d..57529f25769 100644 --- a/gcc/expr.c +++ b/gcc/expr.c @@ -2489,7 +2489,9 @@ emit_group_store (rtx orig_dst, rtx src, tree type ATTRIBUTE_UNUSED, for (i = start; i < finish; i++) { rtx reg = XEXP (XVECEXP (src, 0, i), 0); - if (!REG_P (reg) || REGNO (reg) < FIRST_PSEUDO_REGISTER) + /* Checking for reg == dst (which is very unlikely to be true) ensures + that we can emit a clobber of dst when dst is a pseudo register. */ + if (!REG_P (reg) || REGNO (reg) < FIRST_PSEUDO_REGISTER || reg == dst) { tmps[i] = gen_reg_rtx (GET_MODE (reg)); emit_move_insn (tmps[i], reg); @@ -2530,11 +2532,36 @@ emit_group_store (rtx orig_dst, rtx src, tree type ATTRIBUTE_UNUSED, if (!REG_P (dst) || REGNO (dst) < FIRST_PSEUDO_REGISTER) dst = gen_reg_rtx (outer); + /* Clobber the destination if pieces of it can be modified + independently. This ensures that we can assign to the individual + pieces of dst without leaving dst upwards exposed. + + The alternatives below are less likely to be useful for this case. + Creating a paradoxical subreg leads to moves like: + + (set (reg:TI X) (subreg:TI (reg:DI Y) 0)) + + On a 64-bit target, this would normally be split into two DI moves + after register allocation. We'd then expect one of those DI moves + to be removed as dead due to later assignments to parts of X. + However, the instruction must still be reloaded as a TImode move + and the allocation of Y must be valid for TImode as well as DImode, + even though we only really want a single DImode move. + + If we set dst to zero instead, there's a risk that the + redundancy will never be eliminated, particularly if the + destination occupies more than two registers. */ + if (known_gt (GET_MODE_SIZE (outer), REGMODE_NATURAL_SIZE (outer))) + { + emit_clobber (dst); + done = 1; + } + /* Make life a bit easier for combine. */ /* If the first element of the vector is the low part of the destination mode, use a paradoxical subreg to initialize the destination. */ - if (start < finish) + if (!done && start < finish) { inner = GET_MODE (tmps[start]); bytepos = subreg_lowpart_offset (inner, outer); -- 2.17.1