Hi!

This patch is a clear improvement :-)

On Thu, Sep 01, 2022 at 11:24:00AM +0800, Jiufu Guo wrote:
> As mentioned in PR106550, since pli could support 34bits immediate, we could
> use less instructions(3insn would be ok) to build 64bits constant with pli.

> For example, for constant 0x020805006106003, we could generate it with:
> asm code1:
> pli 9,101736451 (0x6106003)
> sldi 9,9,32
> paddi 9,9, 2130000 (0x0208050)

3 insns, 2 insns dependent on the previous, each.

> or asm code2:
> pli 10, 2130000
> pli 9, 101736451
> rldimi 9, 10, 32, 0

3 insns, 1 insn dependent on both others.

> Testing with simple cases as below, run them a lot of times:
> f1.c
> long __attribute__ ((noinline)) foo (long *arg,long *,long*)
> {
>   *arg = 0x2351847027482577;
> }
> 5insns: base
> pli+sldi+paddi: similar -0.08%
> pli+pli+rldimi: faster +0.66%

This mostly tests how well this micro-benchmark is scheduled.  More time
is spent in the looping and function calls (not shown)!

> f2.c
> long __attribute__ ((noinline)) foo (long *arg, long *arg2, long *arg3)
> {
>   *arg = 0x2351847027482577;
>   *arg2 = 0x3257845024384680;
>   *arg3 = 0x1245abcef9240dec;
> }
> 5nisns: base
> pli+sldi+paddi: faster +1.35%
> pli+pli+rldimi: faster +5.49%
> 
> f2.c would be more meaningful.  Because 'sched passes' are effective for
> f2.c, but 'scheds' do less thing for f1.c.

It still is a too small example to mean much without looking at a
pipeview, or at the very least perf.  But the results show a solid
improvement as expected ;-)

> gcc/ChangeLog:
>       PR target/106550
>       * config/rs6000/rs6000.cc (rs6000_emit_set_long_const): Add 'pli' for
>       constant building.

"Use pli." ?

> gcc/testsuite/ChangeLog:
>       PR target/106550
>       * gcc.target/powerpc/pr106550.c: New test.

> +  else if (TARGET_PREFIXED)
> +    {
> +      /* pli 9,high32 + pli 10,low32 + rldimi 9,10,32,0.  */

But not just 9 and 10.  Use A and B or X and Y or H and L or something
like that?

The comment goes...

> +      if (can_create_pseudo_p ())
> +     {

... here.

> +       temp = gen_reg_rtx (DImode);
> +       rtx temp1 = gen_reg_rtx (DImode);
> +       emit_move_insn (copy_rtx (temp), GEN_INT ((ud4 << 16) | ud3));
> +       emit_move_insn (copy_rtx (temp1), GEN_INT ((ud2 << 16) | ud1));
> +
> +       emit_insn (gen_rotldi3_insert_3 (dest, temp, GEN_INT (32), temp1,
> +                                        GEN_INT (0xffffffff)));
> +     }
> +

No blank line here please.

> +      /* pli 9,high32 + sldi 9,32 + paddi 9,9,low32.  */
> +      else
> +     {

The comment goes here, in the block it refers to.  Comments for a block
are the first thing *in* the block.

> +       emit_move_insn (copy_rtx (dest), GEN_INT ((ud4 << 16) | ud3));
> +
> +       emit_move_insn (copy_rtx (dest),
> +                       gen_rtx_ASHIFT (DImode, copy_rtx (dest),
> +                                       GEN_INT (32)));
> +
> +       bool can_use_paddi = REGNO (dest) != FIRST_GPR_REGNO;

There should be a test that we so the right thing (or *a* right thing,
anyway; a working thing; but hopefully a reasonably fast thing) for
!can_use_paddi.

> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/pr106550.c
> @@ -0,0 +1,14 @@
> +/* PR target/106550 */
> +/* { dg-options "-O2 -std=c99 -mdejagnu-cpu=power10" } */
> +
> +void
> +foo (unsigned long long *a)
> +{
> +  *a++ = 0x020805006106003;
> +  *a++ = 0x2351847027482577;  
> +}
> +
> +/* 3 insns for each constant: pli+sldi+paddi or pli+pli+rldimi.
> +   And 3 additional insns: std+std+blr: 9 insns totally.  */
> +/* { dg-final { scan-assembler-times {(?n)^\s+[a-z]} 9 } } */

Also test the expected insns separately please?  The std's (just with
\mstd so it will catch all variations as well), the blr, the pli's and
the rldimi etc.?

We also should test all special cases as well.  Especially those that do
not happen all over the place, we will notice something is broken there
easy enough.  But unlikely cases can take years to show up.

Okay for trunk with the formatting fixed.  Thank you!


Segher

Reply via email to