> >> On Wed, May 15, 2024 at 12:29 PM Tamar Christina > >> <tamar.christ...@arm.com> wrote: > >> > > >> > Hi All, > >> > > >> > Some Neoverse Software Optimization Guides (SWoG) have a clause that > >> > state > >> > that for predicated operations that also produce a predicate it is > >> > preferred > >> > that the codegen should use a different register for the destination > >> > than that > >> > of the input predicate in order to avoid a performance overhead. > >> > > >> > This of course has the problem that it increases register pressure and so > should > >> > be done with care. Additionally not all micro-architectures have this > >> > consideration and so it shouldn't be done as a default thing. > >> > > >> > The patch series adds support for doing conditional early clobbers > >> > through a > >> > combination of new alternatives and attributes to control their > >> > availability. > >> > >> You could have two alternatives, one with early clobber and one with > >> a matching constraint where you'd disparage the matching constraint one? > >> > > > > Yeah, that's what I do, though there's no need to disparage the non-early > > clobber > > alternative as the early clobber alternative will naturally get a penalty > > if it needs a > > reload. > > But I think Richard's suggestion was to disparage the one with a matching > constraint (not the earlyclobber), to reflect the increased cost of > reusing the register. > > We did take that approach for gathers, e.g.: > > [&w, Z, w, Ui1, Ui1, Upl] ld1<Vesize>\t%0.s, %5/z, [%2.s] > [?w, Z, 0, Ui1, Ui1, Upl] ^ > > The (supposed) advantage is that, if register pressure is so tight > that using matching registers is the only alternative, we still > have the opportunity to do that, as a last resort. > > Providing only an earlyclobber version means that using the same > register is prohibited outright. If no other register is free, the RA > would need to spill something else to free up a temporary register. > And it might then do the equivalent of (pseudo-code): > > not p1.b, ..., p0.b > mov p0.d, p1.d > > after spilling what would otherwise have occupied p1. In that > situation it would be better use: > > not p0.b, ..., p0.b > > and not introduce the spill of p1.
I think I understood what Richi meant, but I thought it was already working that way. i.e. as one of the testcases I had: > aarch64-none-elf-gcc -O3 -g0 -S -o - pred-clobber.c -mcpu=neoverse-n2 > -ffixed-p[1-15] foo: mov z31.h, w0 ptrue p0.b, all cmplo p0.h, p0/z, z0.h, z31.h b use and reload did not force a spill. My understanding of how this works, and how it seems to be working is that since reload costs Alternative from front to back the cheapest one wins and it stops evaluating the rest. The early clobber case is first and preferred, however when it's not possible, i.e. requires a non-pseudo reload, the reload cost is added to the alternative. However you're right that in the following testcase: -mcpu=neoverse-n2 -ffixed-p1 -ffixed-p2 -ffixed-p3 -ffixed-p4 -ffixed-p5 -ffixed-p6 -ffixed-p7 -ffixed-p8 -ffixed-p9 -ffixed-p10 -ffixed-p11 -ffixed-p12 -ffixed-p12 -ffixed-p13 -ffixed-p14 -ffixed-p14 -fdump-rtl-reload i.e. giving it an extra free register inexplicably causes a spill: foo: addvl sp, sp, #-1 mov z31.h, w0 ptrue p0.b, all str p15, [sp] cmplo p15.h, p0/z, z0.h, z31.h mov p0.b, p15.b ldr p15, [sp] addvl sp, sp, #1 b use so that's unexpected and is very weird as p15 has no defined value.. Now adding the ? as suggested to the non-early clobber alternative does not fix it, and my mental model for how this is supposed to work does not quite line up.. Why would making the non-clobber alternative even more expensive help it during high register pressure?? But with that suggestion the above case does not get fixed and the following case -mcpu=neoverse-n2 -ffixed-p1 -ffixed-p2 -ffixed-p3 -ffixed-p4 -ffixed-p5 -ffixed-p6 -ffixed-p7 -ffixed-p8 -ffixed-p9 -ffixed-p10 -ffixed-p11 -ffixed-p12 -ffixed-p12 -ffixed-p13 -ffixed-p14 -ffixed-p15 -fdump-rtl-reload ICEs: pred-clobber.c: In function 'foo': pred-clobber.c:9:1: error: unable to find a register to spill 9 | } | ^ pred-clobber.c:9:1: error: this is the insn: (insn 10 22 19 2 (parallel [ (set (reg:VNx8BI 110 [104]) (unspec:VNx8BI [ (reg:VNx8BI 112) (const_int 1 [0x1]) (ltu:VNx8BI (reg:VNx8HI 32 v0) (reg:VNx8HI 63 v31)) ] UNSPEC_PRED_Z)) (clobber (reg:CC_NZC 66 cc)) ]) "pred-clobber.c":7:19 8687 {aarch64_pred_cmplovnx8hi} (expr_list:REG_DEAD (reg:VNx8BI 112) (expr_list:REG_DEAD (reg:VNx8HI 63 v31) (expr_list:REG_DEAD (reg:VNx8HI 32 v0) (expr_list:REG_UNUSED (reg:CC_NZC 66 cc) (nil)))))) during RTL pass: reload dump file: pred-clobber.c.315r.reload and this is because the use of ? has the unintended side-effect of blocking a register class entirely during Sched1 as we've recently discovered. i.e. see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114766 in this case it marked the alternative as NO_REGS during sched1 and so it's completely dead. the use of the ? alternatives has caused quite the code bloat as we've recently discovered because of this unexpected and undocumented behavior. To me, diff --git a/gcc/config/aarch64/aarch64-sve.md b/gcc/config/aarch64/aarch64-sve.md index 93ec59e58af..2ee3d8ea35e 100644 --- a/gcc/config/aarch64/aarch64-sve.md +++ b/gcc/config/aarch64/aarch64-sve.md @@ -8120,10 +8120,10 @@ (define_insn "@aarch64_pred_cmp<cmp_op><mode>" (clobber (reg:CC_NZC CC_REGNUM))] "TARGET_SVE" {@ [ cons: =0 , 1 , 3 , 4 ; attrs: pred_clobber ] - [ &Upa , Upl , w , <sve_imm_con>; yes ] cmp<cmp_op>\t%0.<Vetype>, %1/z, %3.<Vetype>, #%4 - [ Upa , Upl , w , <sve_imm_con>; * ] ^ - [ &Upa , Upl , w , w ; yes ] cmp<cmp_op>\t%0.<Vetype>, %1/z, %3.<Vetype>, %4.<Vetype> - [ Upa , Upl , w , w ; * ] ^ + [ ^&Upa , Upl , w , <sve_imm_con>; yes ] cmp<cmp_op>\t%0.<Vetype>, %1/z, %3.<Vetype>, #%4 + [ Upa , Upl , w , <sve_imm_con>; * ] ^ + [ ^&Upa , Upl , w , w ; yes ] cmp<cmp_op>\t%0.<Vetype>, %1/z, %3.<Vetype>, %4.<Vetype> + [ Upa , Upl , w , w ; * ] ^ } ) Would have been the right approach, i.e. we prefer the alternative unless a reload is needed, which should work no? well. if ^ wasn't broken the same way as ?. Perhaps I need to use Wilco's new alternative that doesn't block a register class? But I'm probably missing something... > > Another case where using matching registers is natural is for > loop-carried dependencies. Do we want to keep them in: > > loop: > ...no other sets of p0.... > not p0.b, ..., p0.b > ...no other sets of p0.... > bne loop > > or should we split it to: > > loop: > ...no other sets of p0.... > not p1.b, ..., p0.b > mov p0.d, p1.d > ...no other sets of p0.... > bne loop > > ? On the uarches that this affects they are equivalent (I'm happy to expand on this internally if you'd like), So in those cases the first one is preferred as it won't matter. Thanks for the review an explanation! Tamar > > Thanks, > Richard > > > > > Cheers, > > Tamar > > > >> > On high register pressure we also use LRA's costing to prefer not to use > >> > the > >> > alternative and instead just use the tie as this is preferable to a > >> > reload. > >> > > >> > Concretely this patch series does: > >> > > >> > > aarch64-none-elf-gcc -O3 -g0 -S -o - pred-clobber.c -mcpu=neoverse-n2 > >> > > >> > foo: > >> > mov z31.h, w0 > >> > ptrue p3.b, all > >> > cmplo p0.h, p3/z, z0.h, z31.h > >> > b use > >> > > >> > > aarch64-none-elf-gcc -O3 -g0 -S -o - pred-clobber.c > >> > > -mcpu=neoverse-n1+sve > >> > > >> > foo: > >> > mov z31.h, w0 > >> > ptrue p0.b, all > >> > cmplo p0.h, p0/z, z0.h, z31.h > >> > b use > >> > > >> > > aarch64-none-elf-gcc -O3 -g0 -S -o - pred-clobber.c -mcpu=neoverse-n2 - > >> ffixed-p[1-15] > >> > > >> > foo: > >> > mov z31.h, w0 > >> > ptrue p0.b, all > >> > cmplo p0.h, p0/z, z0.h, z31.h > >> > b use > >> > > >> > Testcases for the changes are in the last patch of the series. > >> > > >> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. > >> > > >> > Thanks, > >> > Tamar > >> > > >> > --- > >> > > >> > --