Hi,

Segher Boessenkool <seg...@kernel.crashing.org> writes:

> Hi!
>
> This patch is a clear improvement :-)
>
> On Thu, Sep 01, 2022 at 11:24:00AM +0800, Jiufu Guo wrote:
>> As mentioned in PR106550, since pli could support 34bits immediate, we could
>> use less instructions(3insn would be ok) to build 64bits constant with pli.
>
>> For example, for constant 0x020805006106003, we could generate it with:
>> asm code1:
>> pli 9,101736451 (0x6106003)
>> sldi 9,9,32
>> paddi 9,9, 2130000 (0x0208050)
>
> 3 insns, 2 insns dependent on the previous, each.
Yeap.
>
>> or asm code2:
>> pli 10, 2130000
>> pli 9, 101736451
>> rldimi 9, 10, 32, 0
>
> 3 insns, 1 insn dependent on both others.
Yes.
>
>> Testing with simple cases as below, run them a lot of times:
>> f1.c
>> long __attribute__ ((noinline)) foo (long *arg,long *,long*)
>> {
>>   *arg = 0x2351847027482577;
>> }
>> 5insns: base
>> pli+sldi+paddi: similar -0.08%
>> pli+pli+rldimi: faster +0.66%
>
> This mostly tests how well this micro-benchmark is scheduled.  More time
> is spent in the looping and function calls (not shown)!
>
>> f2.c
>> long __attribute__ ((noinline)) foo (long *arg, long *arg2, long *arg3)
>> {
>>   *arg = 0x2351847027482577;
>>   *arg2 = 0x3257845024384680;
>>   *arg3 = 0x1245abcef9240dec;
>> }
>> 5nisns: base
>> pli+sldi+paddi: faster +1.35%
>> pli+pli+rldimi: faster +5.49%
>> 
>> f2.c would be more meaningful.  Because 'sched passes' are effective for
>> f2.c, but 'scheds' do less thing for f1.c.
>
> It still is a too small example to mean much without looking at a
> pipeview, or at the very least perf.  But the results show a solid
> improvement as expected ;-)
Right, checking how the 'cycles' are using on each instructions would be
more meaningful to demonstrate how the runtime is changing.
>
>> gcc/ChangeLog:
>>      PR target/106550
>>      * config/rs6000/rs6000.cc (rs6000_emit_set_long_const): Add 'pli' for
>>      constant building.
>
> "Use pli." ?
Thanks, will update.
>
>> gcc/testsuite/ChangeLog:
>>      PR target/106550
>>      * gcc.target/powerpc/pr106550.c: New test.
>
>> +  else if (TARGET_PREFIXED)
>> +    {
>> +      /* pli 9,high32 + pli 10,low32 + rldimi 9,10,32,0.  */
>
> But not just 9 and 10.  Use A and B or X and Y or H and L or something
> like that?
OK,  will updata accordingly.
>
> The comment goes...
>
>> +      if (can_create_pseudo_p ())
>> +    {
>
> ... here.
>
>> +      temp = gen_reg_rtx (DImode);
>> +      rtx temp1 = gen_reg_rtx (DImode);
>> +      emit_move_insn (copy_rtx (temp), GEN_INT ((ud4 << 16) | ud3));
>> +      emit_move_insn (copy_rtx (temp1), GEN_INT ((ud2 << 16) | ud1));
>> +
>> +      emit_insn (gen_rotldi3_insert_3 (dest, temp, GEN_INT (32), temp1,
>> +                                       GEN_INT (0xffffffff)));
>> +    }
>> +
>
> No blank line here please.
>
>> +      /* pli 9,high32 + sldi 9,32 + paddi 9,9,low32.  */
>> +      else
>> +    {
>
> The comment goes here, in the block it refers to.  Comments for a block
> are the first thing *in* the block.
OK, great! I like the format you sugguested here :-)
>
>> +      emit_move_insn (copy_rtx (dest), GEN_INT ((ud4 << 16) | ud3));
>> +
>> +      emit_move_insn (copy_rtx (dest),
>> +                      gen_rtx_ASHIFT (DImode, copy_rtx (dest),
>> +                                      GEN_INT (32)));
>> +
>> +      bool can_use_paddi = REGNO (dest) != FIRST_GPR_REGNO;
>
> There should be a test that we so the right thing (or *a* right thing,
> anyway; a working thing; but hopefully a reasonably fast thing) for
> !can_use_paddi.
To catch this test point, we need let the splitter run after RA,
and register 0 happen to be the dest of an assignment.
Oh, below case would be useful for this test point:

/* { dg-options "-O2 -std=c99 -mdejagnu-cpu=power10 -fdisable-rtl-split1" } */
/* force the constant splitter run after RA: -fdisable-rtl-split1
   a few assignments to make sure r0 is allocated as dest of an assign.  */

void
foo (unsigned long long *a)
{
  *a++ = 0x020805006106003;
  *a++ = 0x2351847027482587;
  *a++ = 0x22513478874a2578;
  *a++ = 0x02180570670b003;
  *a++ = 0x2311847029488587;
  *a++ = 0x335184b02748757f;
  *a++ = 0x720805006096003;
  *a++ = 0x23a18b70a74e257e;
  *a++ = 0x2a518a70a74a2567;
  *a++ = 0x5208a5da0606a03;
  *a++ = 0x1391a47a2749257a;
  *a++ = 0x235a847027488576;
  *a++ = 0x23a1847027482677;  
}

/* { dg-final { scan-assembler-times {\moris\M} 1 } } */
/* { dg-final { scan-assembler-times {\mori\M} 1 } } */

I will add this test case in patch.
Is this ok?  Any sugguestions?
                                           
>
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/powerpc/pr106550.c
>> @@ -0,0 +1,14 @@
>> +/* PR target/106550 */
>> +/* { dg-options "-O2 -std=c99 -mdejagnu-cpu=power10" } */
>> +
>> +void
>> +foo (unsigned long long *a)
>> +{
>> +  *a++ = 0x020805006106003;
>> +  *a++ = 0x2351847027482577;  
>> +}
>> +
>> +/* 3 insns for each constant: pli+sldi+paddi or pli+pli+rldimi.
>> +   And 3 additional insns: std+std+blr: 9 insns totally.  */
>> +/* { dg-final { scan-assembler-times {(?n)^\s+[a-z]} 9 } } */
>
> Also test the expected insns separately please?  The std's (just with
> \mstd so it will catch all variations as well), the blr, the pli's and
> the rldimi etc.?
The reason of using "(?n)^\s+[a-z]" is to keep this test case pass no
matter the splitter running before or after RA.

Yes, using real instructions to check would be better.
Will update the case to check real instructions 'pli' and 'rldimi'.

>
> We also should test all special cases as well.  Especially those that do
> not happen all over the place, we will notice something is broken there
> easy enough.  But unlikely cases can take years to show up.
Understand :)

BR,
Jeff(Jiufu)

>
> Okay for trunk with the formatting fixed.  Thank you!
>
>
> Segher

Reply via email to