On Mon, 13 Jun 2022, Tamar Christina wrote:

> > -----Original Message-----
> > From: Richard Biener <rguent...@suse.de>
> > Sent: Monday, June 13, 2022 9:38 AM
> > To: Richard Sandiford <richard.sandif...@arm.com>
> > Cc: Tamar Christina <tamar.christ...@arm.com>; gcc-patches@gcc.gnu.org;
> > nd <n...@arm.com>; Richard Earnshaw <richard.earns...@arm.com>;
> > Marcus Shawcroft <marcus.shawcr...@arm.com>; Kyrylo Tkachov
> > <kyrylo.tkac...@arm.com>; ro...@eyesopen.com
> > Subject: Re: [PATCH]AArch64 relax predicate on load structure load
> > instructions
> > 
> > On Mon, 13 Jun 2022, Richard Sandiford wrote:
> > 
> > > Richard Biener <rguent...@suse.de> writes:
> > > > On Wed, 8 Jun 2022, Richard Sandiford wrote:
> > > >> Tamar Christina <tamar.christ...@arm.com> writes:
> > > >> >> -----Original Message-----
> > > >> >> From: Richard Sandiford <richard.sandif...@arm.com>
> > > >> >> Sent: Wednesday, June 8, 2022 11:31 AM
> > > >> >> To: Tamar Christina <tamar.christ...@arm.com>
> > > >> >> Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; Richard Earnshaw
> > > >> >> <richard.earns...@arm.com>; Marcus Shawcroft
> > > >> >> <marcus.shawcr...@arm.com>; Kyrylo Tkachov
> > > >> >> <kyrylo.tkac...@arm.com>
> > > >> >> Subject: Re: [PATCH]AArch64 relax predicate on load structure
> > > >> >> load instructions
> > > >> >>
> > > >> >> Tamar Christina <tamar.christ...@arm.com> writes:
> > > >> >> > Hi All,
> > > >> >> >
> > > >> >> > At some point in time we started lowering the ld1r instructions in
> > gimple.
> > > >> >> >
> > > >> >> > That is:
> > > >> >> >
> > > >> >> > uint8x8_t f1(const uint8_t *in) {
> > > >> >> >     return vld1_dup_u8(&in[1]); }
> > > >> >> >
> > > >> >> > generates at gimple:
> > > >> >> >
> > > >> >> >   _3 = MEM[(const uint8_t *)in_1(D) + 1B];
> > > >> >> >   _4 = {_3, _3, _3, _3, _3, _3, _3, _3};
> > > >> >> >
> > > >> >> > Which is good, but we then generate:
> > > >> >> >
> > > >> >> > f1:
> > > >> >> >   ldr     b0, [x0, 1]
> > > >> >> >   dup     v0.8b, v0.b[0]
> > > >> >> >   ret
> > > >> >> >
> > > >> >> > instead of ld1r.
> > > >> >> >
> > > >> >> > The reason for this is because the load instructions have a
> > > >> >> > too restrictive predicate on them which causes combine not to
> > > >> >> > be able to combine the instructions due to the predicate only
> > > >> >> > accepting simple
> > > >> >> addressing modes.
> > > >> >> >
> > > >> >> > This patch relaxes the predicate to accept any memory operand
> > > >> >> > and relies on LRA to legitimize the address when it needs to
> > > >> >> > as the constraint still only allows the simple addressing
> > > >> >> > mode.  Reload is always able to legitimize to these.
> > > >> >> >
> > > >> >> > Secondly since we are now actually generating more ld1r it
> > > >> >> > became clear that the lane instructions suffer from a similar 
> > > >> >> > issue.
> > > >> >> >
> > > >> >> > i.e.
> > > >> >> >
> > > >> >> > float32x4_t f2(const float32_t *in, float32x4_t a) {
> > > >> >> >     float32x4_t dup = vld1q_dup_f32(&in[1]);
> > > >> >> >     return vfmaq_laneq_f32 (a, a, dup, 1); }
> > > >> >> >
> > > >> >> > would generate ld1r + vector fmla instead of ldr + lane fmla.
> > > >> >> >
> > > >> >> > The reason for this is similar to the ld1r issue.  The
> > > >> >> > predicate is too restrictive in only acception register operands 
> > > >> >> > but
> > not memory.
> > > >> >> >
> > > >> >> > This relaxes it to accept register and/or memory while leaving
> > > >> >> > the constraint to only accept registers.  This will have LRA
> > > >> >> > generate a reload if needed forcing the memory to registers
> > > >> >> > using the standard
> > > >> >> patterns.
> > > >> >> >
> > > >> >> > These two changes allow combine and reload to generate the
> > > >> >> > right
> > > >> >> sequences.
> > > >> >> >
> > > >> >> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> > > >> >>
> > > >> >> This is going against the general direction of travel, which is
> > > >> >> to make the instruction's predicates and conditions enforce the
> > > >> >> constraints as much as possible (making optimistic assumptions
> > about pseudo registers).
> > > >> >>
> > > >> >> The RA *can* deal with things like:
> > > >> >>
> > > >> >>   (match_operand:M N "general_operand" "r")
> > > >> >>
> > > >> >> but it's best avoided, for a few reasons:
> > > >> >>
> > > >> >> (1) The fix-up will be done in LRA, so IRA will not see the 
> > > >> >> temporary
> > > >> >>     registers.  This can make the allocation of those temporaries
> > > >> >>     suboptimal but (more importantly) it might require other
> > > >> >>     previously-allocated registers to be spilled late due to the
> > > >> >>     unexpected increase in register pressure.
> > > >> >>
> > > >> >> (2) It ends up hiding instructions from the pre-RA optimisers.
> > > >> >>
> > > >> >> (3) It can also prevent combine opportunities (as well as create
> > them),
> > > >> >>     unless the loose predicates in an insn I are propagated to all
> > > >> >>     patterns that might result from combining I with something else.
> > > >> >>
> > > >> >> It sounds like the first problem (not generating ld1r) could be
> > > >> >> fixed by (a) combining aarch64_simd_dup<mode> and
> > > >> >> *aarch64_simd_ld1r<mode>, so that the register and memory
> > > >> >> alternatives are in the same pattern and (b) using the merged
> > instruction(s) to implement the vec_duplicate optab.
> > > >> >> Target-independent code should then make the address satisfy the
> > > >> >> predicate, simplifying the address where necessary.
> > > >> >>
> > > >> >
> > > >> > I think I am likely missing something here. I would assume that
> > > >> > you wanted to use the optab to split the addressing off from the
> > > >> > mem expression so the combined insn matches.
> > > >> >
> > > >> > But in that case, why do you need to combine the two instructions?
> > > >> > I've tried and it doesn't work since the vec_duplicate optab
> > > >> > doesn't see the mem as op1, because in gimple the mem is not part of
> > the duplicate.
> > > >> >
> > > >> > So you still just see:
> > > >> >
> > > >> >>>> dbgrtx (ops[1].value)
> > > >> > (subreg/s/v:QI (reg:SI 92 [ _3 ]) 0)
> > > >> >
> > > >> > As the operand as the argument to the dup is just an SSA_NAME.
> > > >>
> > > >> Ah, yeah, I'd forgotten that fixed-length vec_duplicates would come
> > > >> from a constructor rather than a vec_duplicate_expr, so we don't
> > > >> get the usual benefit of folding single-use mems during expand.
> > > >>
> > > >> https://gcc.gnu.org/pipermail/gcc-patches/2022-May/595362.html
> > > >> moves towards using vec_duplicate even for fixed-length vectors.
> > > >> If we take that approach, then I suppose a plain constructor should
> > > >> be folded to a vec_duplicate where possible.
> > > >>
> > > >> (Alternatively, we could use an extended vec_perm_expr with scalar
> > > >> inputs, as Richi suggested in that thread.)
> > > >>
> > > >> If we don't do that, or don't do it yet, then…
> > > >
> > > > I suppose since we alrady have vec_duplicate we can just use it ...
> > > > what was the reason to not do this originally?
> > >
> > > There just wasn't any specific benefit for fixed-length vectors at the
> > > time, and obvious potential problems -- introducing
> > > VEC_DUPLICATE_EXPRs too early would lose out on existing
> > CONSTRUCTOR-based folds.
> > >
> > > Also, isel didn't exist at the time that vec_duplicate was added, but
> > > it seems like it might be a good place to do the replacement.
> > >
> > > Match rules that want to test for a uniform vector operand can already
> > > use vec_same_elem_p to handle all representations, but perhaps we also
> > > need a way of generating the “right” form of duplicate for the current
> > > stage in the pass pipeline?
> > 
> > I think we can have vec_duplicate without native target support by
> > expanding via CONSTRUCTOR, so vec_duplicate would be the correct one at
> > all stages and we fixup during RTL expansion directly.
> > 
> > As you noted most targets don't implement vec_duplicate yet.
> > 
> > > > I suppose the
> > > > vec_duplicate expander has a fallback via store_constuctor?
> > > >
> > > > Originally I wanted to avoid multiple ways to express the same thing
> > > > but vec_duplicate is a common enough special-case and it also
> > > > usually maps to a special instruction in vector ISAs.
> > > > There's VIEW_CONVERT vs. vec_duplicate for V1m modes then, I
> > suppose
> > > > VIEW_CONVERT is more canonical here.
> > >
> > > Is that already true for V1m constructors?  (view_convert being
> > > canonical and constructors not, I mean.)
> > 
> > I think so, yes.
> > 
> > > What do you think about the suggestion in the other thread of making
> > > VEC_PERM_EXPR take an arbitrary number of inputs, with (as you
> > > suggested) the inputs allowed to be scalars rather than vectors?
> > > VEC_PERM_EXPR could then replace both CONSTRUCTOR and
> > VEC_DUPLICATE_EXPR and “optimising”
> > > a normal constructor to a duplicate would just be a case of removing
> > > repeated scalar inputs.
> > 
> > It's indeed somewhat appealing to make VEC_PERM a swiss army knife.
> > I'm not sure about making it a VL tree though, currently it's a nice GIMPLE
> > ternary while VL would make it a SINGLE RHS with a GENERIC tree (unless we
> > introduce a gimple_vec_perm special gimple node).  That said, allowing
> > scalars as VEC_PERM inputs to get rid of VEC_DUPLICATE only will still leave
> > us with the VIEW_CONVERT special case.
> > 
> > At some point we might want to help targets with "interesting"
> > ISAs by lowering VEC_PERM to supported .VEC_PERM_CONSTs and relaxing
> > what permutes we allow earlier in the pipeline (I'm thinking of x86 with its
> > many special permutation ops and the open-coded vec-perm-const
> > expander).
> > 
> > So not sure what to do, but I'm happy to widen VEC_DUPLICATE_EXPR use.
> 
> Just to check, this means detect VEC_DUPLICATE_EXPR during isel and convert
> the CONSTRUCTOR to it?

Hmm, that's a possibility but I thought of using VEC_DUPLICATE_EXPR
already initially (build_vector_from_val, some CTOR to duplicate folding, 
etc.).  Clearly detecting VEC_DUPLICATE_EXPR at ISEL time would be least
intrusive in case we want to resort to further IL streamlining during
this stage1.  So yes, this should work to fix your optimization
problem.

Richard.

Reply via email to