Re: [PATCH][GCC][AArch64]: Break apart paradoxical subregs for VSTRUCT writes (PR target/94052)

2020-03-13 Thread Richard Sandiford
Tamar Christina  writes:
> Hi Richard,
>
> I have updated the patch, changelog is the same.
>
> bootstrapped and regtested on aarch64-none-linux-gnu and no issues.
>
> OK for gcc 9 and 8?
>
> Thanks,
> Tamar
>
> [...]
> diff --git a/gcc/config/aarch64/aarch64-simd.md 
> b/gcc/config/aarch64/aarch64-simd.md
> index 
> 89aaf8c018e3340dd2d53fc2a6538d3d1220b103..16abe70be0ad90d3befdc5c4dfdb1471f2c98c58
>  100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -5330,6 +5330,26 @@ (define_expand "mov"
>if (GET_CODE (operands[0]) != REG)
>   operands[1] = force_reg (mode, operands[1]);
>  }
> +
> +  /* If we have a paradoxical subreg trying to write to  from and the
> + registers don't overlap then we need to break it apart.  What it's 
> trying
> + to do is give two kind of information at the same time.  It's trying to
> + convey liveness information by saying that the entire register will be
> + written to eventually, but it also only wants to write a single part of 
> the
> + register.  Hence the paradoxical subreg.
> +
> + Instead of allowing this we will split the two concerns.  The liveness
> + information will be conveyed using a clobber and then we break apart the
> + paradoxical subreg into just a normal write of the part that it wanted 
> to
> + write originally.  */
> +
> +  if (REG_P(operands[0]) && paradoxical_subreg_p (operands[1]))

Sorry for the nit, but: space between REG_P and (

OK for backports with that change, once the bug is fixed in master.

Thanks,
Richard

> +{
> +  if (!reg_overlap_mentioned_p (operands[0], operands[1]))
> + emit_clobber (operands[0]);
> +  operands[1] = SUBREG_REG (operands[1]);
> +  operands[0] = gen_lowpart (GET_MODE (operands[1]), operands[0]);
> +}
>  })
 
 


Re: [PATCH][GCC][AArch64]: Break apart paradoxical subregs for VSTRUCT writes (PR target/94052)

2020-03-12 Thread Tamar Christina
Hi Richard,

I have updated the patch, changelog is the same.

bootstrapped and regtested on aarch64-none-linux-gnu and no issues.

OK for gcc 9 and 8?

Thanks,
Tamar

The 03/10/2020 11:49, Richard Sandiford wrote:
> Tamar Christina  writes:
> > Hi All,
> >
> > This works around an ICE in reload where from expand we get the following 
> > RTL
> > generated for VSTRUCT mode writes:
> >
> > (insn 446 354 445 2 (set (reg:CI 383)
> >  (subreg:CI (reg:V4SI 291) 0)) "small.i":146:22 3408 {*aarch64_movci}
> >  (nil))
> >
> > This sequence is trying to say two things:
> >
> > 1) liveliness: It's trying to say that eventually the whole CI reg will be
> >written to. It does this by generating the paradoxical subreg.
> > 2) write data: It's trying to in the same instruction also write the V4SI 
> > mode
> >component at offset 0 in the CI reg.
> >
> > Reload is unable to understand this concept and so it attempts to handle 
> > this
> > instruction by breaking apart the instruction, first writing the data and 
> > then
> > tries to reload the paradoxical part.  This gets it to the same instruction
> > again and eventually we ICE since we reach the limit of no. reloads.
> 
> reload/LRA does understand the concept.  It just isn't handling this
> particular case very well :-)
> 
> The pre-RA insn is:
> 
> (insn 210 218 209 6 (set (reg/v:OI 182 [ vres ])
> (subreg:OI (reg:V4SI 289) 0)) "pr94052.C":157:31 3518 {*aarch64_movoi}
>  (expr_list:REG_DEAD (reg:V4SI 289)
> (nil)))
> 
> IRA allocates a hard register to r182 but not r289:
> 
>   126:r182 l032  ...
>   ...
>   129:r289 l0   mem  ...
> 
> This is because memory appears to be cheaper:
> 
>   a129(r289,l2) costs: FP_LO8_REGS:136,136 FP_LO_REGS:136,136 FP_REGS:136,136 
> MEM:110,110
> 
> That's probably a bug in itself (possibly in the target costs), but it
> shouldn't be a correctness issue.
> 
> LRA then handles insn 210 like this:
> 
>   Creating newreg=317, assigning class ALL_REGS to slow/invalid mem r317
>   Creating newreg=318, assigning class ALL_REGS to slow/invalid mem r318
>   210: r182:OI=r318:OI
>   REG_DEAD r289:V4SI
> Inserting slow/invalid mem reload before:
>   355: r317:V4SI=[sfp:DI+0x60]
>   356: r318:OI=r317:V4SI#0
> 
> 1 Non pseudo reload: reject++
>   alt=0,overall=1,losers=0,rld_nregs=0
>  Choosing alt 0 in insn 210:  (0) =w  (1) w {*aarch64_movoi}
>   Change to class FP_REGS for r318
> 
> That looks OK so far, given the allocation.  But later we have:
> 
> ** Assignment #2: **
> 
>  Assigning to 318 (cl=FP_REGS, orig=318, freq=44, tfirst=318, 
> tfreq=44)...
>Assign 32 to subreg reload r318 (freq=44)
>  Assigning to 317 (cl=ALL_REGS, orig=317, freq=44, tfirst=317, 
> tfreq=44)...
>Assign 0 to subreg reload r317 (freq=44)
> 
> So LRA is assigning a GPR (x0) to the new V4SI register r317.  It's this
> allocation that induces the cycling, because we get:
> 
>Cycle danger: overall += LRA_MAX_REJECT
>   alt=0,overall=606,losers=1,rld_nregs=2
> 0 Spill pseudo into memory: reject+=3
> Using memory insn operand 0: reject+=3
> 0 Non input pseudo reload: reject++
> Cycle danger: overall += LRA_MAX_REJECT
>   alt=1,overall=619,losers=2,rld_nregs=2
> 1 Spill pseudo into memory: reject+=3
> Using memory insn operand 1: reject+=3
>   alt=2,overall=12,losers=1,rld_nregs=0
>  Choosing alt 2 in insn 356:  (0) w  (1) Utv {*aarch64_movoi}
>   Creating newreg=319, assigning class NO_REGS to r319
>   356: r318:OI=r319:OI
>   REG_DEAD r317:V4SI
> Inserting insn reload before:
>   357: r319:OI=r317:V4SI#0
> 
> 0 Non input pseudo reload: reject++
>   alt=0,overall=13,losers=2,rld_nregs=4
> 0 Non pseudo reload: reject++
>   alt=1,overall=7,losers=1,rld_nregs=2
> 0 Non input pseudo reload: reject++
> 1 Spill pseudo into memory: reject+=3
> Using memory insn operand 1: reject+=3
> alt=2,overall=19,losers=2 -- refuse
>  Choosing alt 1 in insn 357:  (0) Utv  (1) w {*aarch64_movoi}
>   Creating newreg=320, assigning class FP_REGS to r320
>   357: r319:OI=r320:OI
> Inserting insn reload before:
>   358: r320:OI=r317:V4SI#0
> 
> and we keep oscillating between those two choices (alt 2 vs alt 1).
> This wouldn't have happened if we'd allocated an FPR instead.
> 
> I think the problem here is that we're always trying to reload the
> subreg rather than the inner register, even though the allocation for
> the inner register isn't valid for the subreg.  There is code in
> simplify_operand_subreg to detect this kind of situation, but it
> seems to be missing a check for hard_regno_mode_ok.  The first patch
> below seems to fix that.
> 
> > This patch fixes it by in the backend when we see such a 

Re: [PATCH][GCC][AArch64]: Break apart paradoxical subregs for VSTRUCT writes (PR target/94052)

2020-03-10 Thread Richard Sandiford
Tamar Christina  writes:
> Hi All,
>
> This works around an ICE in reload where from expand we get the following RTL
> generated for VSTRUCT mode writes:
>
> (insn 446 354 445 2 (set (reg:CI 383)
>  (subreg:CI (reg:V4SI 291) 0)) "small.i":146:22 3408 {*aarch64_movci}
>  (nil))
>
> This sequence is trying to say two things:
>
> 1) liveliness: It's trying to say that eventually the whole CI reg will be
>  written to. It does this by generating the paradoxical subreg.
> 2) write data: It's trying to in the same instruction also write the V4SI mode
>  component at offset 0 in the CI reg.
>
> Reload is unable to understand this concept and so it attempts to handle this
> instruction by breaking apart the instruction, first writing the data and then
> tries to reload the paradoxical part.  This gets it to the same instruction
> again and eventually we ICE since we reach the limit of no. reloads.

reload/LRA does understand the concept.  It just isn't handling this
particular case very well :-)

The pre-RA insn is:

(insn 210 218 209 6 (set (reg/v:OI 182 [ vres ])
(subreg:OI (reg:V4SI 289) 0)) "pr94052.C":157:31 3518 {*aarch64_movoi}
 (expr_list:REG_DEAD (reg:V4SI 289)
(nil)))

IRA allocates a hard register to r182 but not r289:

  126:r182 l032  ...
  ...
  129:r289 l0   mem  ...

This is because memory appears to be cheaper:

  a129(r289,l2) costs: FP_LO8_REGS:136,136 FP_LO_REGS:136,136 FP_REGS:136,136 
MEM:110,110

That's probably a bug in itself (possibly in the target costs), but it
shouldn't be a correctness issue.

LRA then handles insn 210 like this:

  Creating newreg=317, assigning class ALL_REGS to slow/invalid mem r317
  Creating newreg=318, assigning class ALL_REGS to slow/invalid mem r318
  210: r182:OI=r318:OI
  REG_DEAD r289:V4SI
Inserting slow/invalid mem reload before:
  355: r317:V4SI=[sfp:DI+0x60]
  356: r318:OI=r317:V4SI#0

1 Non pseudo reload: reject++
  alt=0,overall=1,losers=0,rld_nregs=0
 Choosing alt 0 in insn 210:  (0) =w  (1) w {*aarch64_movoi}
  Change to class FP_REGS for r318

That looks OK so far, given the allocation.  But later we have:

** Assignment #2: **

 Assigning to 318 (cl=FP_REGS, orig=318, freq=44, tfirst=318, 
tfreq=44)...
   Assign 32 to subreg reload r318 (freq=44)
 Assigning to 317 (cl=ALL_REGS, orig=317, freq=44, tfirst=317, 
tfreq=44)...
   Assign 0 to subreg reload r317 (freq=44)

So LRA is assigning a GPR (x0) to the new V4SI register r317.  It's this
allocation that induces the cycling, because we get:

   Cycle danger: overall += LRA_MAX_REJECT
  alt=0,overall=606,losers=1,rld_nregs=2
0 Spill pseudo into memory: reject+=3
Using memory insn operand 0: reject+=3
0 Non input pseudo reload: reject++
Cycle danger: overall += LRA_MAX_REJECT
  alt=1,overall=619,losers=2,rld_nregs=2
1 Spill pseudo into memory: reject+=3
Using memory insn operand 1: reject+=3
  alt=2,overall=12,losers=1,rld_nregs=0
 Choosing alt 2 in insn 356:  (0) w  (1) Utv {*aarch64_movoi}
  Creating newreg=319, assigning class NO_REGS to r319
  356: r318:OI=r319:OI
  REG_DEAD r317:V4SI
Inserting insn reload before:
  357: r319:OI=r317:V4SI#0

0 Non input pseudo reload: reject++
  alt=0,overall=13,losers=2,rld_nregs=4
0 Non pseudo reload: reject++
  alt=1,overall=7,losers=1,rld_nregs=2
0 Non input pseudo reload: reject++
1 Spill pseudo into memory: reject+=3
Using memory insn operand 1: reject+=3
alt=2,overall=19,losers=2 -- refuse
 Choosing alt 1 in insn 357:  (0) Utv  (1) w {*aarch64_movoi}
  Creating newreg=320, assigning class FP_REGS to r320
  357: r319:OI=r320:OI
Inserting insn reload before:
  358: r320:OI=r317:V4SI#0

and we keep oscillating between those two choices (alt 2 vs alt 1).
This wouldn't have happened if we'd allocated an FPR instead.

I think the problem here is that we're always trying to reload the
subreg rather than the inner register, even though the allocation for
the inner register isn't valid for the subreg.  There is code in
simplify_operand_subreg to detect this kind of situation, but it
seems to be missing a check for hard_regno_mode_ok.  The first patch
below seems to fix that.

> This patch fixes it by in the backend when we see such a paradoxical
> construction breaking it apart and issuing a clobber to correct the liveliness
> information and then emitting a normal subreg write for the component that the
> paradoxical subreg was trying to write to.
>
> Concretely we generate this:
>
> (insn 42 41 43 (clobber (reg/v:CI 122 [ diD.5226 ])) "small.i":121:23 -1
>  (nil))
>
> (insn 43 42 44 (set (subreg:V4SI (reg/v:CI 122 [ diD.5226 ]) 0)
> (reg:V4SI 136)) "small.i":121:23 -1
>  (nil))
>
> 

RE: [PATCH][GCC][AArch64]: Break apart paradoxical subregs for VSTRUCT writes (PR target/94052)

2020-03-09 Thread Tamar Christina
My Attachment seems to have been stripped. So sending it again inline and 
hoping the tabs don't get stripped.

diff --git a/gcc/config/aarch64/aarch64-simd.md 
b/gcc/config/aarch64/aarch64-simd.md
index 
89aaf8c018e3340dd2d53fc2a6538d3d1220b103..47428dd3e9c4fa8fc1f3d876e774defb6bc64640
 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -5330,6 +5330,29 @@ (define_expand "mov"
   if (GET_CODE (operands[0]) != REG)
operands[1] = force_reg (mode, operands[1]);
 }
+
+  /* If we have a paradoxical subreg trying to write to  from and the
+ registers don't overlap then we need to break it apart.  What it's trying
+ to do is give two kind of information at the same time.  It's trying to
+ convey liveness information by saying that the entire register will be
+ written to eventually, but it also only wants to write a single part of 
the
+ register.  Hence the paradoxical subreg.
+
+ However reload doesn't understand this concept and it will ultimately ICE.
+ Instead of allowing this we will split the two concerns.  The liveness
+ information will be conveyed using a clobber and then we break apart the
+ paradoxical subreg into just a normal write of the part that it wanted to
+ write originally.  */
+
+  if (paradoxical_subreg_p (operands[1]))
+{
+  if (!reg_overlap_mentioned_p (operands[0], operands[1]))
+   emit_clobber (operands[0]);
+  poly_uint64 offset = SUBREG_BYTE (operands[1]);
+  operands[1] = SUBREG_REG (operands[1]);
+  operands[0] = simplify_gen_subreg (GET_MODE (operands[1]), operands[0],
+mode, offset);
+}
 })
 
 
diff --git a/gcc/testsuite/g++.target/aarch64/pr94052.C 
b/gcc/testsuite/g++.target/aarch64/pr94052.C
new file mode 100644
index 
..d36c9bdc1588533db35eb3cbd2502034edd25452
--- /dev/null
+++ b/gcc/testsuite/g++.target/aarch64/pr94052.C
@@ -0,0 +1,174 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O2 -std=gnu++11 -w" } */
+
+namespace c {
+typedef int d;
+template  struct f { typedef e g; };
+template  struct h;
+template  e aa(typename f::g i) { return i; }
+template  struct j {};
+template  struct k;
+template  struct k<1, j> { typedef m g; };
+template  typename k>::g ab(j);
+} // namespace c
+typedef long d;
+typedef char o;
+typedef int p;
+typedef char q;
+typedef int r;
+namespace {
+struct s;
+constexpr d t = 6;
+template  class ad {
+public:
+  static constexpr d u = t;
+  d v();
+  d x();
+  d y();
+};
+class z : ad {};
+struct ae {
+  p af;
+};
+class ag {
+public:
+  ae ah();
+};
+} // namespace
+typedef __Int32x4_t ai;
+typedef struct {
+  ai aj[2];
+} ak;
+typedef int al;
+void am(p *a, ai b) { __builtin_aarch64_st1v4si(a, b); }
+namespace an {
+class ao {
+public:
+  bool operator==(ao);
+  d v();
+  d x();
+};
+class ap : public ad {};
+class aq {
+public:
+  c::j ar();
+  int as();
+  int at();
+};
+class au {
+public:
+  virtual d av(d);
+  virtual ap aw();
+  virtual ag ax();
+};
+class ay {};
+class az {
+  virtual void ba(const ay &, const s &);
+};
+using bb = az;
+class bc;
+class bd : bb {
+  void ba(const ay &, const s &);
+  bc *be;
+  bc *bf;
+  bc *bg;
+  aq bh;
+  int bi;
+  int bj;
+  ao bk;
+};
+namespace bl {
+namespace bm {
+namespace bn {
+class bo;
+}
+} // namespace bm
+} // namespace bl
+namespace bn {
+template >
+ai bp(ac *, ac *, ac *, al, al, al, d, p);
+template >
+ak bq(ac *br, ac *bs, ac *bt, al bu, al bv, al bw, d bx, int, int by) {
+  ak{bp(br, bs, bt, bu, bv, bw, bx, by), bp(br, bs, bt, bu, bv, bw, bx, by)};
+}
+template >
+ak bz(ac *, ac *, ac *, al, al, al &, int, p);
+template  void ca(p *, const ak &);
+template <> void ca<1>(p *buffer, const ak ) {
+  am(buffer, cb.aj[0]);
+  am(buffer + 4, cb.aj[1]);
+}
+int cc(int, int);
+} // namespace bn
+class bc {
+public:
+  virtual au *cd();
+};
+class ce {
+public:
+  q *cf();
+};
+template  struct cg {
+  template  static void ci(ay, z cj, ch ck) { ck(cj); }
+};
+template  void cl(ay w, ch ck) {
+  z cj;
+  cg::ci(w, cj, c::aa(ck));
+}
+namespace {
+template  class co {
+public:
+  static void convolve(ay, int cs, bc *cp, bc *cq, bc *cr, aq cw, int, ao ct) {
+int by = cp->cd()->ax().ah().af;
+int cu = cq->cd()->ax().ah().af;
+cp->cd()->aw().v();
+int cv = cp->cd()->aw().x();
+cp->cd()->aw().y();
+cp->cd()->aw();
+int da = cr->cd()->aw().x();
+int cx = cq->cd()->aw().x();
+cq->cd()->aw().y();
+int cy = cr->cd()->av(0);
+int cz = cr->cd()->av(1);
+bn::cc(cs, cn);
+int de = c::ab<1>(cw.ar());
+cw.as();
+cw.at();
+ay db;
+ce dc;
+ce dd;
+ce w;
+q *di = w.cf();
+cl(db, [&](z) {
+  int df;
+  dc;
+  di;
+  cx;
+  auto dg(cu);
+  auto dh(cu);
+  auto dl(cu);
+  for (; cz; df += de) {
+auto br = reinterpret_cast(cv);
+auto bs = reinterpret_cast(cv);
+