Am 30.08.24 um 14:46 schrieb Richard Biener:
On Fri, Aug 30, 2024 at 2:10 PM Georg-Johann Lay <a...@gjlay.de> wrote:

There are cases, where opportunities to use POST_INC addressing
only occur very late in the compilation process.  Take for example
the following function from AVR-LibC's qsort:

void swapfunc (char *a, char *b, int n)
{
      do
      {
          char t = *a;
          *a++ = *b;
          *b++ = t;
      } while (--n > 0);
}


which  -mmcu=avrtiny -S -Os -dp  compiles to:

swapfunc:
         push r28                 ;  72  [c=4 l=1]  pushqi1/0
         push r29                 ;  73  [c=4 l=1]  pushqi1/0
/* prologue: function */
/* frame size = 0 */
/* stack size = 2 */
         mov r26,r24      ;  66  [c=4 l=1]  movqi_insn/0
         mov r27,r25      ;  67  [c=4 l=1]  movqi_insn/0
         mov r30,r22      ;  68  [c=4 l=1]  movqi_insn/0
         mov r31,r23      ;  69  [c=4 l=1]  movqi_insn/0
         mov r22,r20      ;  70  [c=4 l=1]  movqi_insn/0
         mov r23,r21      ;  71  [c=4 l=1]  movqi_insn/0
.L2:
         ld r20,X                 ;  55  [c=4 l=1]  movqi_insn/3
         ld r21,Z                 ;  57  [c=4 l=1]  movqi_insn/3
         st X,r21                 ;  58  [c=4 l=1]  movqi_insn/2
         subi r26,-1      ;  59  [c=4 l=2]  *addhi3_clobber/0
         sbci r27,-1
         st Z,r20                 ;  61  [c=4 l=1]  movqi_insn/2
         subi r30,-1      ;  62  [c=4 l=2]  *addhi3_clobber/0
         sbci r31,-1
         subi r22, 1      ;  81  [c=4 l=2]  *add.for.cczn.hi/0
         sbci r23, 0
         breq .+2         ;  82  [c=8 l=2]  branch_ZN
         brpl .L2
/* epilogue start */
         pop r29          ;  76  [c=4 l=1]  popqi
         pop r28          ;  77  [c=4 l=1]  popqi
         ret              ;  78  [c=0 l=1]  return_from_epilogue

Insn 56+57 and insns 61+62 are post-inc stores.  They are not recognized
because the code prior to cprop_hardreg is a bit of a mess that moves
addresses back and forth (including Y).  Only after cprop_hardreg the
code is simple enough so post-inc can be detected by avr-fuse-add.

Hence this patch runs a 2nd instance of that pass late after
cprop_hardreg (the 1st instance runs prior to RTL peephole).

It renames avr_split_tiny_move to avr_split_fake_addressing_move
because that function also splits some insns on non-avrtiny.

The patch removes a define_split that's no more needed because
such splits are performed by avr_split_fake_addressing_move.

Passed without new regressions.  Ok for trunk?

Johann

p.s. post-inc etc. optimizations is basically non-existent in GCC.
One reasons seems to be SSA because each SSA variable gets its own
register, which results in a jungle of required registers which
even pass auto-inc-dec cannot penetrate...

Take for example the following code, that would require 12 instructions
as indicated by the comments:

void add4 (uint8_t *aa, const __flash uint8_t *bb, uint8_t nn)
{
      // Set Z (R30) to bb (1 MOVW or 2 MOVs)
      // Set X (R26) to aa (1 MOVW or 2 MOVs)
      do
      {
          uint8_t sum = 0;
          sum += *bb++; // 1 instruction: POST_INC load
          sum += *bb++; // 2 instructions: POST_INC load + add
          sum += *bb++; // 2 instructions: POST_INC load + add
          sum += *bb++; // 2 instructions: POST_INC load + add
          *aa++ = sum;  // 1 instruction: POST_INC store
      } while (--nn);   // 2 instructions: dec + branch
}

I think the reason here is weird behavior with __flash and how IVOPTS
tries to enable auto-incdec:

   _30 = bb_2 + 1;
   _10 = MEM[(const <address-space-1> uint8_t *)_30];
   _29 = bb_2 + 2;
   _12 = MEM[(const <address-space-1> uint8_t *)_29];
   _19 = _10 + _12;
   sum_13 = _19 + _9;
   bb_14 = bb_2 + 4;
   _28 = bb_14 + 65535;
   _15 = MEM[(const <address-space-1> uint8_t *)_28];

without __flash you get

   _10 = MEM[(const uint8_t *)_27 + 1B];
   sum_11 = _9 + _10;
   _12 = MEM[(const uint8_t *)_27 + 2B];
   sum_13 = sum_11 + _12;
   _15 = MEM[(const uint8_t *)_27 + 3B];

which of ocurse is a problem for pre/post-inc addressing as well
but one that I think is solved (well, you can hope...).

Can you open a bugreport with this testcase?

https://gcc.gnu.org/PR116542

Without __flash, the code uses Z, Z+1, Z+2 and Z+3 as addresses,
but still uses one register for access and one register to increment
the address.

Then there are these crazy computation like:

start_address - current_address + nn == 0

as condition for loop termination instead of just dec + branch
of nn.

Johann


But  -mmcu=avr4 -Os -S -dp  compiles this to madness that
requires more than a dozen registers and computes each
intermediate in its own 16-bit register, coming out with
code that requires 34 instructions instead of 12.

--

AVR: Run pass avr-fuse-add a second time after pass_cprop_hardreg.

gcc/
         * config/avr/avr-protos.h (avr_split_tiny_move): Rename to
         avr_split_fake_addressing_move.
         * config/avr/avr-passes.cc: Same.
         (avr_pass_data_fuse_add) <tv_id>: Set to TV_MACH_DEP.
         (avr_pass_fuse_add) <clone>: Override.
         * config/avr/avr-passes.def (avr_pass_fuse_add): Run again
         after pass_cprop_hardreg.
         * config/avr/avr.md (split-lpmx): Remove a define_split.  Such
         splits are performed by avr_split_fake_addressing_move.

Reply via email to