RE: [PATCH 0/4]AArch64: support conditional early clobbers on certain operations.

Tamar Christina Wed, 15 May 2024 08:56:46 -0700

> >> On Wed, May 15, 2024 at 12:29 PM Tamar Christina
> >> <tamar.christ...@arm.com> wrote:
> >> >
> >> > Hi All,
> >> >
> >> > Some Neoverse Software Optimization Guides (SWoG) have a clause that 
> >> > state
> >> > that for predicated operations that also produce a predicate it is 
> >> > preferred
> >> > that the codegen should use a different register for the destination 
> >> > than that
> >> > of the input predicate in order to avoid a performance overhead.
> >> >
> >> > This of course has the problem that it increases register pressure and so
> should
> >> > be done with care.  Additionally not all micro-architectures have this
> >> > consideration and so it shouldn't be done as a default thing.
> >> >
> >> > The patch series adds support for doing conditional early clobbers 
> >> > through a
> >> > combination of new alternatives and attributes to control their 
> >> > availability.
> >>
> >> You could have two alternatives, one with early clobber and one with
> >> a matching constraint where you'd disparage the matching constraint one?
> >>
> >
> > Yeah, that's what I do, though there's no need to disparage the non-early 
> > clobber
> > alternative as the early clobber alternative will naturally get a penalty 
> > if it needs a
> > reload.
> 
> But I think Richard's suggestion was to disparage the one with a matching
> constraint (not the earlyclobber), to reflect the increased cost of
> reusing the register.
> 
> We did take that approach for gathers, e.g.:
> 
>      [&w, Z,   w, Ui1, Ui1, Upl] ld1<Vesize>\t%0.s, %5/z, [%2.s]
>      [?w, Z,   0, Ui1, Ui1, Upl] ^
> 
> The (supposed) advantage is that, if register pressure is so tight
> that using matching registers is the only alternative, we still
> have the opportunity to do that, as a last resort.
> 
> Providing only an earlyclobber version means that using the same
> register is prohibited outright.  If no other register is free, the RA
> would need to spill something else to free up a temporary register.
> And it might then do the equivalent of (pseudo-code):
> 
>       not p1.b, ..., p0.b
>       mov p0.d, p1.d
> 
> after spilling what would otherwise have occupied p1.  In that
> situation it would be better use:
> 
>       not p0.b, ..., p0.b
> 
> and not introduce the spill of p1.


I think I understood what Richi meant, but I thought it was already working 
that way.
i.e. as one of the testcases I had:

> aarch64-none-elf-gcc -O3 -g0 -S -o - pred-clobber.c -mcpu=neoverse-n2 
> -ffixed-p[1-15]

foo:
        mov     z31.h, w0
        ptrue   p0.b, all
        cmplo   p0.h, p0/z, z0.h, z31.h
        b       use

and reload did not force a spill.

My understanding of how this works, and how it seems to be working is that 
since reload costs
Alternative from front to back the cheapest one wins and it stops evaluating 
the rest.

The early clobber case is first and preferred, however when it's not possible, 
i.e. requires a non-pseudo
reload, the reload cost is added to the alternative.

However you're right that in the following testcase:

-mcpu=neoverse-n2 -ffixed-p1 -ffixed-p2 -ffixed-p3 -ffixed-p4 -ffixed-p5 
-ffixed-p6 -ffixed-p7 -ffixed-p8 -ffixed-p9 -ffixed-p10 -ffixed-p11 -ffixed-p12 
-ffixed-p12 -ffixed-p13 -ffixed-p14 -ffixed-p14 -fdump-rtl-reload

i.e. giving it an extra free register inexplicably causes a spill:

foo:
        addvl   sp, sp, #-1
        mov     z31.h, w0
        ptrue   p0.b, all
        str     p15, [sp]
        cmplo   p15.h, p0/z, z0.h, z31.h
        mov     p0.b, p15.b
        ldr     p15, [sp]
        addvl   sp, sp, #1
        b       use

so that's unexpected and is very weird as p15 has no defined value..

Now adding the ? as suggested to the non-early clobber alternative does not fix 
it, and my mental model for how this is supposed to work does not quite line 
up..
Why would making the non-clobber alternative even more expensive help it during 
high register pressure?? But with that suggestion the above case does not get 
fixed
and the following case

-mcpu=neoverse-n2 -ffixed-p1 -ffixed-p2 -ffixed-p3 -ffixed-p4 -ffixed-p5 
-ffixed-p6 -ffixed-p7 -ffixed-p8 -ffixed-p9 -ffixed-p10 -ffixed-p11 -ffixed-p12 
-ffixed-p12 -ffixed-p13 -ffixed-p14 -ffixed-p15 -fdump-rtl-reload

ICEs:

pred-clobber.c: In function 'foo':
pred-clobber.c:9:1: error: unable to find a register to spill
    9 | }
      | ^
pred-clobber.c:9:1: error: this is the insn:
(insn 10 22 19 2 (parallel [
            (set (reg:VNx8BI 110 [104])
                (unspec:VNx8BI [
                        (reg:VNx8BI 112)
                        (const_int 1 [0x1])
                        (ltu:VNx8BI (reg:VNx8HI 32 v0)
                            (reg:VNx8HI 63 v31))
                    ] UNSPEC_PRED_Z))
            (clobber (reg:CC_NZC 66 cc))
        ]) "pred-clobber.c":7:19 8687 {aarch64_pred_cmplovnx8hi}
     (expr_list:REG_DEAD (reg:VNx8BI 112)
        (expr_list:REG_DEAD (reg:VNx8HI 63 v31)
            (expr_list:REG_DEAD (reg:VNx8HI 32 v0)
                (expr_list:REG_UNUSED (reg:CC_NZC 66 cc)
                    (nil))))))
during RTL pass: reload
dump file: pred-clobber.c.315r.reload

and this is because the use of ? has the unintended side-effect of blocking a 
register class entirely during Sched1 as we've recently discovered.
i.e. see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114766

in this case it marked the alternative as NO_REGS during sched1 and so it's 
completely dead.
the use of the ? alternatives has caused quite the code bloat as we've recently 
discovered because of this unexpected and undocumented behavior.

To me,

diff --git a/gcc/config/aarch64/aarch64-sve.md 
b/gcc/config/aarch64/aarch64-sve.md
index 93ec59e58af..2ee3d8ea35e 100644
--- a/gcc/config/aarch64/aarch64-sve.md
+++ b/gcc/config/aarch64/aarch64-sve.md
@@ -8120,10 +8120,10 @@ (define_insn "@aarch64_pred_cmp<cmp_op><mode>"
    (clobber (reg:CC_NZC CC_REGNUM))]
   "TARGET_SVE"
   {@ [ cons: =0 , 1   , 3 , 4            ; attrs: pred_clobber ]
-     [ &Upa     , Upl , w , <sve_imm_con>; yes                 ] 
cmp<cmp_op>\t%0.<Vetype>, %1/z, %3.<Vetype>, #%4
-     [ Upa      , Upl , w , <sve_imm_con>; *                   ] ^
-     [ &Upa     , Upl , w , w            ; yes                 ] 
cmp<cmp_op>\t%0.<Vetype>, %1/z, %3.<Vetype>, %4.<Vetype>
-     [ Upa      , Upl , w , w            ; *                   ] ^
+     [ ^&Upa    , Upl , w , <sve_imm_con>; yes                 ] 
cmp<cmp_op>\t%0.<Vetype>, %1/z, %3.<Vetype>, #%4
+     [  Upa     , Upl , w , <sve_imm_con>; *                   ] ^
+     [ ^&Upa    , Upl , w , w            ; yes                 ] 
cmp<cmp_op>\t%0.<Vetype>, %1/z, %3.<Vetype>, %4.<Vetype>
+     [  Upa     , Upl , w , w            ; *                   ] ^
   }
 )

Would have been the right approach, i.e. we prefer the alternative unless a 
reload is needed, which should work no? well. if ^ wasn't broken the same way
as ?.  Perhaps I need to use Wilco's new alternative that doesn't block a 
register class?

But I'm probably missing something...

> 
> Another case where using matching registers is natural is for
> loop-carried dependencies.  Do we want to keep them in:
> 
>    loop:
>       ...no other sets of p0....
>       not p0.b, ..., p0.b
>       ...no other sets of p0....
>       bne loop
> 
> or should we split it to:
> 
>    loop:
>       ...no other sets of p0....
>       not p1.b, ..., p0.b
>       mov p0.d, p1.d
>       ...no other sets of p0....
>       bne loop
> 
> ?

On the uarches that this affects they are equivalent (I'm happy to expand on 
this internally if you'd like),
So in those cases the first one is preferred as it won't matter.

Thanks for the review an explanation!

Tamar

> 
> Thanks,
> Richard
> 
> >
> > Cheers,
> > Tamar
> >
> >> > On high register pressure we also use LRA's costing to prefer not to use 
> >> > the
> >> > alternative and instead just use the tie as this is preferable to a 
> >> > reload.
> >> >
> >> > Concretely this patch series does:
> >> >
> >> > > aarch64-none-elf-gcc -O3 -g0 -S -o - pred-clobber.c -mcpu=neoverse-n2
> >> >
> >> > foo:
> >> >         mov     z31.h, w0
> >> >         ptrue   p3.b, all
> >> >         cmplo   p0.h, p3/z, z0.h, z31.h
> >> >         b       use
> >> >
> >> > > aarch64-none-elf-gcc -O3 -g0 -S -o - pred-clobber.c 
> >> > > -mcpu=neoverse-n1+sve
> >> >
> >> > foo:
> >> >         mov     z31.h, w0
> >> >         ptrue   p0.b, all
> >> >         cmplo   p0.h, p0/z, z0.h, z31.h
> >> >         b       use
> >> >
> >> > > aarch64-none-elf-gcc -O3 -g0 -S -o - pred-clobber.c -mcpu=neoverse-n2 -
> >> ffixed-p[1-15]
> >> >
> >> > foo:
> >> >         mov     z31.h, w0
> >> >         ptrue   p0.b, all
> >> >         cmplo   p0.h, p0/z, z0.h, z31.h
> >> >         b       use
> >> >
> >> > Testcases for the changes are in the last patch of the series.
> >> >
> >> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> >> >
> >> > Thanks,
> >> > Tamar
> >> >
> >> > ---
> >> >
> >> > --

RE: [PATCH 0/4]AArch64: support conditional early clobbers on certain operations.

Reply via email to