On Fri, Aug 30, 2024 at 3:28 PM Georg-Johann Lay <a...@gjlay.de> wrote: > > Am 30.08.24 um 14:46 schrieb Richard Biener: > > On Fri, Aug 30, 2024 at 2:10 PM Georg-Johann Lay <a...@gjlay.de> wrote: > >> > >> There are cases, where opportunities to use POST_INC addressing > >> only occur very late in the compilation process. Take for example > >> the following function from AVR-LibC's qsort: > >> > >> void swapfunc (char *a, char *b, int n) > >> { > >> do > >> { > >> char t = *a; > >> *a++ = *b; > >> *b++ = t; > >> } while (--n > 0); > >> } > >> > >> > >> which -mmcu=avrtiny -S -Os -dp compiles to: > >> > >> swapfunc: > >> push r28 ; 72 [c=4 l=1] pushqi1/0 > >> push r29 ; 73 [c=4 l=1] pushqi1/0 > >> /* prologue: function */ > >> /* frame size = 0 */ > >> /* stack size = 2 */ > >> mov r26,r24 ; 66 [c=4 l=1] movqi_insn/0 > >> mov r27,r25 ; 67 [c=4 l=1] movqi_insn/0 > >> mov r30,r22 ; 68 [c=4 l=1] movqi_insn/0 > >> mov r31,r23 ; 69 [c=4 l=1] movqi_insn/0 > >> mov r22,r20 ; 70 [c=4 l=1] movqi_insn/0 > >> mov r23,r21 ; 71 [c=4 l=1] movqi_insn/0 > >> .L2: > >> ld r20,X ; 55 [c=4 l=1] movqi_insn/3 > >> ld r21,Z ; 57 [c=4 l=1] movqi_insn/3 > >> st X,r21 ; 58 [c=4 l=1] movqi_insn/2 > >> subi r26,-1 ; 59 [c=4 l=2] *addhi3_clobber/0 > >> sbci r27,-1 > >> st Z,r20 ; 61 [c=4 l=1] movqi_insn/2 > >> subi r30,-1 ; 62 [c=4 l=2] *addhi3_clobber/0 > >> sbci r31,-1 > >> subi r22, 1 ; 81 [c=4 l=2] *add.for.cczn.hi/0 > >> sbci r23, 0 > >> breq .+2 ; 82 [c=8 l=2] branch_ZN > >> brpl .L2 > >> /* epilogue start */ > >> pop r29 ; 76 [c=4 l=1] popqi > >> pop r28 ; 77 [c=4 l=1] popqi > >> ret ; 78 [c=0 l=1] return_from_epilogue > >> > >> Insn 56+57 and insns 61+62 are post-inc stores. They are not recognized > >> because the code prior to cprop_hardreg is a bit of a mess that moves > >> addresses back and forth (including Y). Only after cprop_hardreg the > >> code is simple enough so post-inc can be detected by avr-fuse-add. > >> > >> Hence this patch runs a 2nd instance of that pass late after > >> cprop_hardreg (the 1st instance runs prior to RTL peephole). > >> > >> It renames avr_split_tiny_move to avr_split_fake_addressing_move > >> because that function also splits some insns on non-avrtiny. > >> > >> The patch removes a define_split that's no more needed because > >> such splits are performed by avr_split_fake_addressing_move. > >> > >> Passed without new regressions. Ok for trunk? > >> > >> Johann > >> > >> p.s. post-inc etc. optimizations is basically non-existent in GCC. > >> One reasons seems to be SSA because each SSA variable gets its own > >> register, which results in a jungle of required registers which > >> even pass auto-inc-dec cannot penetrate... > >> > >> Take for example the following code, that would require 12 instructions > >> as indicated by the comments: > >> > >> void add4 (uint8_t *aa, const __flash uint8_t *bb, uint8_t nn) > >> { > >> // Set Z (R30) to bb (1 MOVW or 2 MOVs) > >> // Set X (R26) to aa (1 MOVW or 2 MOVs) > >> do > >> { > >> uint8_t sum = 0; > >> sum += *bb++; // 1 instruction: POST_INC load > >> sum += *bb++; // 2 instructions: POST_INC load + add > >> sum += *bb++; // 2 instructions: POST_INC load + add > >> sum += *bb++; // 2 instructions: POST_INC load + add > >> *aa++ = sum; // 1 instruction: POST_INC store > >> } while (--nn); // 2 instructions: dec + branch > >> } > > > > I think the reason here is weird behavior with __flash and how IVOPTS > > tries to enable auto-incdec: > > > > _30 = bb_2 + 1; > > _10 = MEM[(const <address-space-1> uint8_t *)_30]; > > _29 = bb_2 + 2; > > _12 = MEM[(const <address-space-1> uint8_t *)_29]; > > _19 = _10 + _12; > > sum_13 = _19 + _9; > > bb_14 = bb_2 + 4; > > _28 = bb_14 + 65535; > > _15 = MEM[(const <address-space-1> uint8_t *)_28]; > > > > without __flash you get > > > > _10 = MEM[(const uint8_t *)_27 + 1B]; > > sum_11 = _9 + _10; > > _12 = MEM[(const uint8_t *)_27 + 2B]; > > sum_13 = sum_11 + _12; > > _15 = MEM[(const uint8_t *)_27 + 3B]; > > > > which of ocurse is a problem for pre/post-inc addressing as well > > but one that I think is solved (well, you can hope...). > > > > Can you open a bugreport with this testcase? > > https://gcc.gnu.org/PR116542
Thanks. > Without __flash, the code uses Z, Z+1, Z+2 and Z+3 as addresses, > but still uses one register for access and one register to increment > the address. > > Then there are these crazy computation like: > > start_address - current_address + nn == 0 > > as condition for loop termination instead of just dec + branch > of nn. That's likely due to costs as estimated by IVOPTs and the backend. Richard. > Johann > > > >> But -mmcu=avr4 -Os -S -dp compiles this to madness that > >> requires more than a dozen registers and computes each > >> intermediate in its own 16-bit register, coming out with > >> code that requires 34 instructions instead of 12. > >> > >> -- > >> > >> AVR: Run pass avr-fuse-add a second time after pass_cprop_hardreg. > >> > >> gcc/ > >> * config/avr/avr-protos.h (avr_split_tiny_move): Rename to > >> avr_split_fake_addressing_move. > >> * config/avr/avr-passes.cc: Same. > >> (avr_pass_data_fuse_add) <tv_id>: Set to TV_MACH_DEP. > >> (avr_pass_fuse_add) <clone>: Override. > >> * config/avr/avr-passes.def (avr_pass_fuse_add): Run again > >> after pass_cprop_hardreg. > >> * config/avr/avr.md (split-lpmx): Remove a define_split. Such > >> splits are performed by avr_split_fake_addressing_move.