On Fri, Aug 30, 2024 at 2:10 PM Georg-Johann Lay <a...@gjlay.de> wrote:
>
> There are cases, where opportunities to use POST_INC addressing
> only occur very late in the compilation process.  Take for example
> the following function from AVR-LibC's qsort:
>
> void swapfunc (char *a, char *b, int n)
> {
>      do
>      {
>          char t = *a;
>          *a++ = *b;
>          *b++ = t;
>      } while (--n > 0);
> }
>
>
> which  -mmcu=avrtiny -S -Os -dp  compiles to:
>
> swapfunc:
>         push r28                 ;  72  [c=4 l=1]  pushqi1/0
>         push r29                 ;  73  [c=4 l=1]  pushqi1/0
> /* prologue: function */
> /* frame size = 0 */
> /* stack size = 2 */
>         mov r26,r24      ;  66  [c=4 l=1]  movqi_insn/0
>         mov r27,r25      ;  67  [c=4 l=1]  movqi_insn/0
>         mov r30,r22      ;  68  [c=4 l=1]  movqi_insn/0
>         mov r31,r23      ;  69  [c=4 l=1]  movqi_insn/0
>         mov r22,r20      ;  70  [c=4 l=1]  movqi_insn/0
>         mov r23,r21      ;  71  [c=4 l=1]  movqi_insn/0
> .L2:
>         ld r20,X                 ;  55  [c=4 l=1]  movqi_insn/3
>         ld r21,Z                 ;  57  [c=4 l=1]  movqi_insn/3
>         st X,r21                 ;  58  [c=4 l=1]  movqi_insn/2
>         subi r26,-1      ;  59  [c=4 l=2]  *addhi3_clobber/0
>         sbci r27,-1
>         st Z,r20                 ;  61  [c=4 l=1]  movqi_insn/2
>         subi r30,-1      ;  62  [c=4 l=2]  *addhi3_clobber/0
>         sbci r31,-1
>         subi r22, 1      ;  81  [c=4 l=2]  *add.for.cczn.hi/0
>         sbci r23, 0
>         breq .+2         ;  82  [c=8 l=2]  branch_ZN
>         brpl .L2
> /* epilogue start */
>         pop r29          ;  76  [c=4 l=1]  popqi
>         pop r28          ;  77  [c=4 l=1]  popqi
>         ret              ;  78  [c=0 l=1]  return_from_epilogue
>
> Insn 56+57 and insns 61+62 are post-inc stores.  They are not recognized
> because the code prior to cprop_hardreg is a bit of a mess that moves
> addresses back and forth (including Y).  Only after cprop_hardreg the
> code is simple enough so post-inc can be detected by avr-fuse-add.
>
> Hence this patch runs a 2nd instance of that pass late after
> cprop_hardreg (the 1st instance runs prior to RTL peephole).
>
> It renames avr_split_tiny_move to avr_split_fake_addressing_move
> because that function also splits some insns on non-avrtiny.
>
> The patch removes a define_split that's no more needed because
> such splits are performed by avr_split_fake_addressing_move.
>
> Passed without new regressions.  Ok for trunk?
>
> Johann
>
> p.s. post-inc etc. optimizations is basically non-existent in GCC.
> One reasons seems to be SSA because each SSA variable gets its own
> register, which results in a jungle of required registers which
> even pass auto-inc-dec cannot penetrate...
>
> Take for example the following code, that would require 12 instructions
> as indicated by the comments:
>
> void add4 (uint8_t *aa, const __flash uint8_t *bb, uint8_t nn)
> {
>      // Set Z (R30) to bb (1 MOVW or 2 MOVs)
>      // Set X (R26) to aa (1 MOVW or 2 MOVs)
>      do
>      {
>          uint8_t sum = 0;
>          sum += *bb++; // 1 instruction: POST_INC load
>          sum += *bb++; // 2 instructions: POST_INC load + add
>          sum += *bb++; // 2 instructions: POST_INC load + add
>          sum += *bb++; // 2 instructions: POST_INC load + add
>          *aa++ = sum;  // 1 instruction: POST_INC store
>      } while (--nn);   // 2 instructions: dec + branch
> }

I think the reason here is weird behavior with __flash and how IVOPTS
tries to enable auto-incdec:

  _30 = bb_2 + 1;
  _10 = MEM[(const <address-space-1> uint8_t *)_30];
  _29 = bb_2 + 2;
  _12 = MEM[(const <address-space-1> uint8_t *)_29];
  _19 = _10 + _12;
  sum_13 = _19 + _9;
  bb_14 = bb_2 + 4;
  _28 = bb_14 + 65535;
  _15 = MEM[(const <address-space-1> uint8_t *)_28];

without __flash you get

  _10 = MEM[(const uint8_t *)_27 + 1B];
  sum_11 = _9 + _10;
  _12 = MEM[(const uint8_t *)_27 + 2B];
  sum_13 = sum_11 + _12;
  _15 = MEM[(const uint8_t *)_27 + 3B];

which of ocurse is a problem for pre/post-inc addressing as well
but one that I think is solved (well, you can hope...).

Can you open a bugreport with this testcase?

> But  -mmcu=avr4 -Os -S -dp  compiles this to madness that
> requires more than a dozen registers and computes each
> intermediate in its own 16-bit register, coming out with
> code that requires 34 instructions instead of 12.
>
> --
>
> AVR: Run pass avr-fuse-add a second time after pass_cprop_hardreg.
>
> gcc/
>         * config/avr/avr-protos.h (avr_split_tiny_move): Rename to
>         avr_split_fake_addressing_move.
>         * config/avr/avr-passes.cc: Same.
>         (avr_pass_data_fuse_add) <tv_id>: Set to TV_MACH_DEP.
>         (avr_pass_fuse_add) <clone>: Override.
>         * config/avr/avr-passes.def (avr_pass_fuse_add): Run again
>         after pass_cprop_hardreg.
>         * config/avr/avr.md (split-lpmx): Remove a define_split.  Such
>         splits are performed by avr_split_fake_addressing_move.

Reply via email to