Re: [patch,avr] Run pass avr-fuse-add a second time

Richard Biener Fri, 30 Aug 2024 06:31:36 -0700

On Fri, Aug 30, 2024 at 3:28 PM Georg-Johann Lay <a...@gjlay.de> wrote:
>
> Am 30.08.24 um 14:46 schrieb Richard Biener:
> > On Fri, Aug 30, 2024 at 2:10 PM Georg-Johann Lay <a...@gjlay.de> wrote:
> >>
> >> There are cases, where opportunities to use POST_INC addressing
> >> only occur very late in the compilation process.  Take for example
> >> the following function from AVR-LibC's qsort:
> >>
> >> void swapfunc (char *a, char *b, int n)
> >> {
> >>       do
> >>       {
> >>           char t = *a;
> >>           *a++ = *b;
> >>           *b++ = t;
> >>       } while (--n > 0);
> >> }
> >>
> >>
> >> which  -mmcu=avrtiny -S -Os -dp  compiles to:
> >>
> >> swapfunc:
> >>          push r28                 ;  72  [c=4 l=1]  pushqi1/0
> >>          push r29                 ;  73  [c=4 l=1]  pushqi1/0
> >> /* prologue: function */
> >> /* frame size = 0 */
> >> /* stack size = 2 */
> >>          mov r26,r24      ;  66  [c=4 l=1]  movqi_insn/0
> >>          mov r27,r25      ;  67  [c=4 l=1]  movqi_insn/0
> >>          mov r30,r22      ;  68  [c=4 l=1]  movqi_insn/0
> >>          mov r31,r23      ;  69  [c=4 l=1]  movqi_insn/0
> >>          mov r22,r20      ;  70  [c=4 l=1]  movqi_insn/0
> >>          mov r23,r21      ;  71  [c=4 l=1]  movqi_insn/0
> >> .L2:
> >>          ld r20,X                 ;  55  [c=4 l=1]  movqi_insn/3
> >>          ld r21,Z                 ;  57  [c=4 l=1]  movqi_insn/3
> >>          st X,r21                 ;  58  [c=4 l=1]  movqi_insn/2
> >>          subi r26,-1      ;  59  [c=4 l=2]  *addhi3_clobber/0
> >>          sbci r27,-1
> >>          st Z,r20                 ;  61  [c=4 l=1]  movqi_insn/2
> >>          subi r30,-1      ;  62  [c=4 l=2]  *addhi3_clobber/0
> >>          sbci r31,-1
> >>          subi r22, 1      ;  81  [c=4 l=2]  *add.for.cczn.hi/0
> >>          sbci r23, 0
> >>          breq .+2         ;  82  [c=8 l=2]  branch_ZN
> >>          brpl .L2
> >> /* epilogue start */
> >>          pop r29          ;  76  [c=4 l=1]  popqi
> >>          pop r28          ;  77  [c=4 l=1]  popqi
> >>          ret              ;  78  [c=0 l=1]  return_from_epilogue
> >>
> >> Insn 56+57 and insns 61+62 are post-inc stores.  They are not recognized
> >> because the code prior to cprop_hardreg is a bit of a mess that moves
> >> addresses back and forth (including Y).  Only after cprop_hardreg the
> >> code is simple enough so post-inc can be detected by avr-fuse-add.
> >>
> >> Hence this patch runs a 2nd instance of that pass late after
> >> cprop_hardreg (the 1st instance runs prior to RTL peephole).
> >>
> >> It renames avr_split_tiny_move to avr_split_fake_addressing_move
> >> because that function also splits some insns on non-avrtiny.
> >>
> >> The patch removes a define_split that's no more needed because
> >> such splits are performed by avr_split_fake_addressing_move.
> >>
> >> Passed without new regressions.  Ok for trunk?
> >>
> >> Johann
> >>
> >> p.s. post-inc etc. optimizations is basically non-existent in GCC.
> >> One reasons seems to be SSA because each SSA variable gets its own
> >> register, which results in a jungle of required registers which
> >> even pass auto-inc-dec cannot penetrate...
> >>
> >> Take for example the following code, that would require 12 instructions
> >> as indicated by the comments:
> >>
> >> void add4 (uint8_t *aa, const __flash uint8_t *bb, uint8_t nn)
> >> {
> >>       // Set Z (R30) to bb (1 MOVW or 2 MOVs)
> >>       // Set X (R26) to aa (1 MOVW or 2 MOVs)
> >>       do
> >>       {
> >>           uint8_t sum = 0;
> >>           sum += *bb++; // 1 instruction: POST_INC load
> >>           sum += *bb++; // 2 instructions: POST_INC load + add
> >>           sum += *bb++; // 2 instructions: POST_INC load + add
> >>           sum += *bb++; // 2 instructions: POST_INC load + add
> >>           *aa++ = sum;  // 1 instruction: POST_INC store
> >>       } while (--nn);   // 2 instructions: dec + branch
> >> }
> >
> > I think the reason here is weird behavior with __flash and how IVOPTS
> > tries to enable auto-incdec:
> >
> >    _30 = bb_2 + 1;
> >    _10 = MEM[(const <address-space-1> uint8_t *)_30];
> >    _29 = bb_2 + 2;
> >    _12 = MEM[(const <address-space-1> uint8_t *)_29];
> >    _19 = _10 + _12;
> >    sum_13 = _19 + _9;
> >    bb_14 = bb_2 + 4;
> >    _28 = bb_14 + 65535;
> >    _15 = MEM[(const <address-space-1> uint8_t *)_28];
> >
> > without __flash you get
> >
> >    _10 = MEM[(const uint8_t *)_27 + 1B];
> >    sum_11 = _9 + _10;
> >    _12 = MEM[(const uint8_t *)_27 + 2B];
> >    sum_13 = sum_11 + _12;
> >    _15 = MEM[(const uint8_t *)_27 + 3B];
> >
> > which of ocurse is a problem for pre/post-inc addressing as well
> > but one that I think is solved (well, you can hope...).
> >
> > Can you open a bugreport with this testcase?
>
> https://gcc.gnu.org/PR116542


Thanks.

> Without __flash, the code uses Z, Z+1, Z+2 and Z+3 as addresses,
> but still uses one register for access and one register to increment
> the address.
>
> Then there are these crazy computation like:
>
> start_address - current_address + nn == 0
>
> as condition for loop termination instead of just dec + branch
> of nn.

That's likely due to costs as estimated by IVOPTs and the backend.

Richard.

> Johann
>
>
> >> But  -mmcu=avr4 -Os -S -dp  compiles this to madness that
> >> requires more than a dozen registers and computes each
> >> intermediate in its own 16-bit register, coming out with
> >> code that requires 34 instructions instead of 12.
> >>
> >> --
> >>
> >> AVR: Run pass avr-fuse-add a second time after pass_cprop_hardreg.
> >>
> >> gcc/
> >>          * config/avr/avr-protos.h (avr_split_tiny_move): Rename to
> >>          avr_split_fake_addressing_move.
> >>          * config/avr/avr-passes.cc: Same.
> >>          (avr_pass_data_fuse_add) <tv_id>: Set to TV_MACH_DEP.
> >>          (avr_pass_fuse_add) <clone>: Override.
> >>          * config/avr/avr-passes.def (avr_pass_fuse_add): Run again
> >>          after pass_cprop_hardreg.
> >>          * config/avr/avr.md (split-lpmx): Remove a define_split.  Such
> >>          splits are performed by avr_split_fake_addressing_move.

Re: [patch,avr] Run pass avr-fuse-add a second time

Reply via email to