Re: [ARC PATCH] Split SImode shifts pre-reload on !TARGET_BARREL_SHIFTER.
Hi Roger, The patch as it is passed the validation, and it is in general OK. Although it doesn't address the elephant in the room, namely output_shift function, it is a welcome cleanup. I would like you to split the patch in two. One which deals with improvements on shifts in absence of a barrel shifter, and one which addresses the default instruction length, as they can be seen as separate work. Please feel free to commit resulting patches to the mainline. Thank you for your contribution, Claudiu On Thu, Sep 28, 2023 at 2:27 PM Roger Sayle wrote: > > > Hi Claudiu, > It was great meeting up with you and the Synopsys ARC team at the > GNU tools Cauldron in Cambridge. > > This patch is the first in a series to improve SImode and DImode > shifts and rotates in the ARC backend. This first piece splits > SImode shifts, for !TARGET_BARREL_SHIFTER targets, after combine > and before reload, in the split1 pass, as suggested by the FIXME > comment above output_shift in arc.cc. To do this I've copied the > implementation of the x86_pre_reload_split function from i386 > backend, and renamed it arc_pre_reload_split. > > Although the actual implementations of shifts remain the same > (as in output_shift), having them as explicit instructions in > the RTL stream allows better scheduling and use of compact forms > when available. The benefits can be seen in two short examples > below. > > For the function: > unsigned int foo(unsigned int x, unsigned int y) { > return y << 2; > } > > GCC with -O2 -mcpu=em would previously generate: > foo:add r1,r1,r1 > add r1,r1,r1 > j_s.d [blink] > mov_s r0,r1 ;4 > and with this patch now generates: > foo:asl_s r0,r1 > j_s.d [blink] > asl_s r0,r0 > > Notice the original (from shift_si3's output_shift) requires the > shift sequence to be monolithic with the same destination register > as the source (requiring an extra mov_s). The new version can > eliminate this move, and schedule the second asl in the branch > delay slot of the return. > > For the function: > int x,y,z; > > void bar() > { > x <<= 3; > y <<= 3; > z <<= 3; > } > > GCC -O2 -mcpu=em currently generates: > bar:push_s r13 > ld.as r12,[gp,@x@sda] ;23 > ld.as r3,[gp,@y@sda] ;23 > mov r2,0 > add3 r12,r2,r12 > mov r2,0 > add3 r3,r2,r3 > ld.as r2,[gp,@z@sda] ;23 > st.as r12,[gp,@x@sda] ;26 > mov r13,0 > add3 r2,r13,r2 > st.as r3,[gp,@y@sda] ;26 > st.as r2,[gp,@z@sda] ;26 > j_s.d [blink] > pop_s r13 > > where each shift by 3, uses ARC's add3 instruction, which is similar > to x86's lea implementing x = (y<<3) + z, but requires the value zero > to be placed in a temporary register "z". Splitting this before reload > allows these pseudos to be shared/reused. With this patch, we get > > bar:ld.as r2,[gp,@x@sda] ;23 > mov_s r3,0;3 > add3r2,r3,r2 > ld.as r3,[gp,@y@sda] ;23 > st.as r2,[gp,@x@sda] ;26 > ld.as r2,[gp,@z@sda] ;23 > mov_s r12,0 ;3 > add3r3,r12,r3 > add3r2,r12,r2 > st.as r3,[gp,@y@sda] ;26 > st.as r2,[gp,@z@sda] ;26 > j_s [blink] > > Unfortunately, register allocation means that we only share two of the > three "mov_s z,0", but this is sufficient to reduce register pressure > enough to avoid spilling r13 in the prologue/epilogue. > > This patch also contains a (latent?) bug fix. The implementation of > the default insn "length" attribute, assumes instructions of type > "shift" have two input operands and accesses operands[2], hence > specializations of shifts that don't have a operands[2], need to be > categorized as type "unary" (which results in the correct length). > > This patch has been tested on a cross-compiler to arc-elf (hosted on > x86_64-pc-linux-gnu), but because I've an incomplete tool chain many > of the regression test fail, but there are no new failures with new > test cases added below. If you can confirm that there are no issues > from additional testing, is this OK for mainline? > > Finally a quick technical question. ARC's zero overhead loops require > at least two instructions in the loop, so currently the backend's > implementation of shr20 pads the loop body with a "nop". > > lshr20: mov.f lp_count, 20 > lpnz2f > lsr r0,r0 > nop > 2: # end single insn loop > j_s [blink] > > could this be more efficiently implemented as: > > lshr20: mov lp_count, 10 > lp 2f > lsr_s r0,r0 > lsr_s r0,r0 > 2: # end single insn loop > j_s [blink] > > i.e. half the number of iterations, but doing twice as much useful > work in each iteration? Or might the nop be free on advanced > microarchitectures, and/or the consecutive dependent shifts cause > a pipeline stall? It would be nice to fuse loops
Re: [ARC PATCH] Split SImode shifts pre-reload on !TARGET_BARREL_SHIFTER.
Hi Roger, It is not necessary to do any mods on your patch. I've just answered the questions which you asked me. The adds are faster for the ARC CPUs which are still in production, and I suppose we can leverage the LP instruction use with DBNZ instructions for implementing loops. I'll come back to you asap, after I've got the nightly results :) Thank you, Claudiu On Tue, Oct 3, 2023 at 6:34 PM Roger Sayle wrote: > > > Hi Claudiu, > Thanks for the answers to my technical questions. > If you'd prefer to update arc.md's add3 pattern first, > I'm happy to update/revise my patch based on this > and your feedback, for example preferring add over > asl_s (or controlling this choice with -Os). > > Thanks again. > Roger > -- > > > -Original Message- > > From: Claudiu Zissulescu > > Sent: 03 October 2023 15:26 > > To: Roger Sayle ; gcc-patches@gcc.gnu.org > > Subject: RE: [ARC PATCH] Split SImode shifts pre-reload on > > !TARGET_BARREL_SHIFTER. > > > > Hi Roger, > > > > It was nice to meet you too. > > > > Thank you in looking into the ARC's non-Barrel Shifter configurations. I > will dive > > into your patch asap, but before starting here are a few of my comments: > > > > -Original Message- > > From: Roger Sayle > > Sent: Thursday, September 28, 2023 2:27 PM > > To: gcc-patches@gcc.gnu.org > > Cc: Claudiu Zissulescu > > Subject: [ARC PATCH] Split SImode shifts pre-reload on > > !TARGET_BARREL_SHIFTER. > > > > > > Hi Claudiu, > > It was great meeting up with you and the Synopsys ARC team at the GNU > tools > > Cauldron in Cambridge. > > > > This patch is the first in a series to improve SImode and DImode shifts > and rotates > > in the ARC backend. This first piece splits SImode shifts, for > > !TARGET_BARREL_SHIFTER targets, after combine and before reload, in the > split1 > > pass, as suggested by the FIXME comment above output_shift in arc.cc. To > do > > this I've copied the implementation of the x86_pre_reload_split function > from > > i386 backend, and renamed it arc_pre_reload_split. > > > > Although the actual implementations of shifts remain the same (as in > > output_shift), having them as explicit instructions in the RTL stream > allows better > > scheduling and use of compact forms when available. The benefits can be > seen in > > two short examples below. > > > > For the function: > > unsigned int foo(unsigned int x, unsigned int y) { > > return y << 2; > > } > > > > GCC with -O2 -mcpu=em would previously generate: > > foo:add r1,r1,r1 > > add r1,r1,r1 > > j_s.d [blink] > > mov_s r0,r1 ;4 > > > > [CZI] The move shouldn't be generated indeed. The use of ADDs are slightly > > beneficial for older ARCv1 arches. > > > > and with this patch now generates: > > foo:asl_s r0,r1 > > j_s.d [blink] > > asl_s r0,r0 > > > > [CZI] Nice. This new sequence is as fast as we can get for our ARCv2 cpus. > > > > Notice the original (from shift_si3's output_shift) requires the shift > sequence to be > > monolithic with the same destination register as the source (requiring an > extra > > mov_s). The new version can eliminate this move, and schedule the second > asl in > > the branch delay slot of the return. > > > > For the function: > > int x,y,z; > > > > void bar() > > { > > x <<= 3; > > y <<= 3; > > z <<= 3; > > } > > > > GCC -O2 -mcpu=em currently generates: > > bar:push_s r13 > > ld.as r12,[gp,@x@sda] ;23 > > ld.as r3,[gp,@y@sda] ;23 > > mov r2,0 > > add3 r12,r2,r12 > > mov r2,0 > > add3 r3,r2,r3 > > ld.as r2,[gp,@z@sda] ;23 > > st.as r12,[gp,@x@sda] ;26 > > mov r13,0 > > add3 r2,r13,r2 > > st.as r3,[gp,@y@sda] ;26 > > st.as r2,[gp,@z@sda] ;26 > > j_s.d [blink] > > pop_s r13 > > > > where each shift by 3, uses ARC's add3 instruction, which is similar to > x86's lea > > implementing x = (y<<3) + z, but requires the value zero to be placed in a > > temporary register "z". Splitting this before reload allows these pseudos > to be > > shared/reused. With this patch, we get > > > > bar:ld.as r2,[gp,@x@sda] ;23 > > mov_s r3,0;3 > > add3r2,r3,r2 > >
RE: [ARC PATCH] Split SImode shifts pre-reload on !TARGET_BARREL_SHIFTER.
Hi Claudiu, Thanks for the answers to my technical questions. If you'd prefer to update arc.md's add3 pattern first, I'm happy to update/revise my patch based on this and your feedback, for example preferring add over asl_s (or controlling this choice with -Os). Thanks again. Roger -- > -Original Message- > From: Claudiu Zissulescu > Sent: 03 October 2023 15:26 > To: Roger Sayle ; gcc-patches@gcc.gnu.org > Subject: RE: [ARC PATCH] Split SImode shifts pre-reload on > !TARGET_BARREL_SHIFTER. > > Hi Roger, > > It was nice to meet you too. > > Thank you in looking into the ARC's non-Barrel Shifter configurations. I will dive > into your patch asap, but before starting here are a few of my comments: > > -Original Message- > From: Roger Sayle > Sent: Thursday, September 28, 2023 2:27 PM > To: gcc-patches@gcc.gnu.org > Cc: Claudiu Zissulescu > Subject: [ARC PATCH] Split SImode shifts pre-reload on > !TARGET_BARREL_SHIFTER. > > > Hi Claudiu, > It was great meeting up with you and the Synopsys ARC team at the GNU tools > Cauldron in Cambridge. > > This patch is the first in a series to improve SImode and DImode shifts and rotates > in the ARC backend. This first piece splits SImode shifts, for > !TARGET_BARREL_SHIFTER targets, after combine and before reload, in the split1 > pass, as suggested by the FIXME comment above output_shift in arc.cc. To do > this I've copied the implementation of the x86_pre_reload_split function from > i386 backend, and renamed it arc_pre_reload_split. > > Although the actual implementations of shifts remain the same (as in > output_shift), having them as explicit instructions in the RTL stream allows better > scheduling and use of compact forms when available. The benefits can be seen in > two short examples below. > > For the function: > unsigned int foo(unsigned int x, unsigned int y) { > return y << 2; > } > > GCC with -O2 -mcpu=em would previously generate: > foo:add r1,r1,r1 > add r1,r1,r1 > j_s.d [blink] > mov_s r0,r1 ;4 > > [CZI] The move shouldn't be generated indeed. The use of ADDs are slightly > beneficial for older ARCv1 arches. > > and with this patch now generates: > foo:asl_s r0,r1 > j_s.d [blink] > asl_s r0,r0 > > [CZI] Nice. This new sequence is as fast as we can get for our ARCv2 cpus. > > Notice the original (from shift_si3's output_shift) requires the shift sequence to be > monolithic with the same destination register as the source (requiring an extra > mov_s). The new version can eliminate this move, and schedule the second asl in > the branch delay slot of the return. > > For the function: > int x,y,z; > > void bar() > { > x <<= 3; > y <<= 3; > z <<= 3; > } > > GCC -O2 -mcpu=em currently generates: > bar:push_s r13 > ld.as r12,[gp,@x@sda] ;23 > ld.as r3,[gp,@y@sda] ;23 > mov r2,0 > add3 r12,r2,r12 > mov r2,0 > add3 r3,r2,r3 > ld.as r2,[gp,@z@sda] ;23 > st.as r12,[gp,@x@sda] ;26 > mov r13,0 > add3 r2,r13,r2 > st.as r3,[gp,@y@sda] ;26 > st.as r2,[gp,@z@sda] ;26 > j_s.d [blink] > pop_s r13 > > where each shift by 3, uses ARC's add3 instruction, which is similar to x86's lea > implementing x = (y<<3) + z, but requires the value zero to be placed in a > temporary register "z". Splitting this before reload allows these pseudos to be > shared/reused. With this patch, we get > > bar:ld.as r2,[gp,@x@sda] ;23 > mov_s r3,0;3 > add3r2,r3,r2 > ld.as r3,[gp,@y@sda] ;23 > st.as r2,[gp,@x@sda] ;26 > ld.as r2,[gp,@z@sda] ;23 > mov_s r12,0 ;3 > add3r3,r12,r3 > add3r2,r12,r2 > st.as r3,[gp,@y@sda] ;26 > st.as r2,[gp,@z@sda] ;26 > j_s [blink] > > [CZI] Looks great, but it also shows that I've forgot to add to ADD3 instruction the > Ra,LIMM,RC variant, which will lead to have instead of > mov_s r3,0;3 > add3r2,r3,r2 > Only this add3,0,r2, Indeed it is longer instruction but faster. > > Unfortunately, register allocation means that we only share two of the three > "mov_s z,0", but this is sufficient to reduce register pressure enough to avoid > spilling r13 in the prologue/epilogue. > > This patch also contains a (latent?) bug fix. The implementation of the default > insn "length" attribute, assumes instructions of type "shift" have two inpu
RE: [ARC PATCH] Split SImode shifts pre-reload on !TARGET_BARREL_SHIFTER.
Hi Roger, It was nice to meet you too. Thank you in looking into the ARC's non-Barrel Shifter configurations. I will dive into your patch asap, but before starting here are a few of my comments: -Original Message- From: Roger Sayle Sent: Thursday, September 28, 2023 2:27 PM To: gcc-patches@gcc.gnu.org Cc: Claudiu Zissulescu Subject: [ARC PATCH] Split SImode shifts pre-reload on !TARGET_BARREL_SHIFTER. Hi Claudiu, It was great meeting up with you and the Synopsys ARC team at the GNU tools Cauldron in Cambridge. This patch is the first in a series to improve SImode and DImode shifts and rotates in the ARC backend. This first piece splits SImode shifts, for !TARGET_BARREL_SHIFTER targets, after combine and before reload, in the split1 pass, as suggested by the FIXME comment above output_shift in arc.cc. To do this I've copied the implementation of the x86_pre_reload_split function from i386 backend, and renamed it arc_pre_reload_split. Although the actual implementations of shifts remain the same (as in output_shift), having them as explicit instructions in the RTL stream allows better scheduling and use of compact forms when available. The benefits can be seen in two short examples below. For the function: unsigned int foo(unsigned int x, unsigned int y) { return y << 2; } GCC with -O2 -mcpu=em would previously generate: foo:add r1,r1,r1 add r1,r1,r1 j_s.d [blink] mov_s r0,r1 ;4 [CZI] The move shouldn't be generated indeed. The use of ADDs are slightly beneficial for older ARCv1 arches. and with this patch now generates: foo:asl_s r0,r1 j_s.d [blink] asl_s r0,r0 [CZI] Nice. This new sequence is as fast as we can get for our ARCv2 cpus. Notice the original (from shift_si3's output_shift) requires the shift sequence to be monolithic with the same destination register as the source (requiring an extra mov_s). The new version can eliminate this move, and schedule the second asl in the branch delay slot of the return. For the function: int x,y,z; void bar() { x <<= 3; y <<= 3; z <<= 3; } GCC -O2 -mcpu=em currently generates: bar:push_s r13 ld.as r12,[gp,@x@sda] ;23 ld.as r3,[gp,@y@sda] ;23 mov r2,0 add3 r12,r2,r12 mov r2,0 add3 r3,r2,r3 ld.as r2,[gp,@z@sda] ;23 st.as r12,[gp,@x@sda] ;26 mov r13,0 add3 r2,r13,r2 st.as r3,[gp,@y@sda] ;26 st.as r2,[gp,@z@sda] ;26 j_s.d [blink] pop_s r13 where each shift by 3, uses ARC's add3 instruction, which is similar to x86's lea implementing x = (y<<3) + z, but requires the value zero to be placed in a temporary register "z". Splitting this before reload allows these pseudos to be shared/reused. With this patch, we get bar:ld.as r2,[gp,@x@sda] ;23 mov_s r3,0;3 add3r2,r3,r2 ld.as r3,[gp,@y@sda] ;23 st.as r2,[gp,@x@sda] ;26 ld.as r2,[gp,@z@sda] ;23 mov_s r12,0 ;3 add3r3,r12,r3 add3r2,r12,r2 st.as r3,[gp,@y@sda] ;26 st.as r2,[gp,@z@sda] ;26 j_s [blink] [CZI] Looks great, but it also shows that I've forgot to add to ADD3 instruction the Ra,LIMM,RC variant, which will lead to have instead of mov_s r3,0;3 add3r2,r3,r2 Only this add3,0,r2, Indeed it is longer instruction but faster. Unfortunately, register allocation means that we only share two of the three "mov_s z,0", but this is sufficient to reduce register pressure enough to avoid spilling r13 in the prologue/epilogue. This patch also contains a (latent?) bug fix. The implementation of the default insn "length" attribute, assumes instructions of type "shift" have two input operands and accesses operands[2], hence specializations of shifts that don't have a operands[2], need to be categorized as type "unary" (which results in the correct length). [CZI] The ARC types need an upgrade too. This patch has been tested on a cross-compiler to arc-elf (hosted on x86_64-pc-linux-gnu), but because I've an incomplete tool chain many of the regression test fail, but there are no new failures with new test cases added below. If you can confirm that there are no issues from additional testing, is this OK for mainline? Finally a quick technical question. ARC's zero overhead loops require at least two instructions in the loop, so currently the backend's implementation of shr20 pads the loop body with a "nop". lshr20: mov.f lp_count, 20 lpnz2f lsr r0,r0 nop 2: # end single insn loop j_s [blink] [CZI] The ZOLs (LP instructions) are not great when dealing with short loop blocks. Hence, the NOP instruction. Personally, I don't fancy using the LP instruction in this case,
[ARC PATCH] Split SImode shifts pre-reload on !TARGET_BARREL_SHIFTER.
Hi Claudiu, It was great meeting up with you and the Synopsys ARC team at the GNU tools Cauldron in Cambridge. This patch is the first in a series to improve SImode and DImode shifts and rotates in the ARC backend. This first piece splits SImode shifts, for !TARGET_BARREL_SHIFTER targets, after combine and before reload, in the split1 pass, as suggested by the FIXME comment above output_shift in arc.cc. To do this I've copied the implementation of the x86_pre_reload_split function from i386 backend, and renamed it arc_pre_reload_split. Although the actual implementations of shifts remain the same (as in output_shift), having them as explicit instructions in the RTL stream allows better scheduling and use of compact forms when available. The benefits can be seen in two short examples below. For the function: unsigned int foo(unsigned int x, unsigned int y) { return y << 2; } GCC with -O2 -mcpu=em would previously generate: foo:add r1,r1,r1 add r1,r1,r1 j_s.d [blink] mov_s r0,r1 ;4 and with this patch now generates: foo:asl_s r0,r1 j_s.d [blink] asl_s r0,r0 Notice the original (from shift_si3's output_shift) requires the shift sequence to be monolithic with the same destination register as the source (requiring an extra mov_s). The new version can eliminate this move, and schedule the second asl in the branch delay slot of the return. For the function: int x,y,z; void bar() { x <<= 3; y <<= 3; z <<= 3; } GCC -O2 -mcpu=em currently generates: bar:push_s r13 ld.as r12,[gp,@x@sda] ;23 ld.as r3,[gp,@y@sda] ;23 mov r2,0 add3 r12,r2,r12 mov r2,0 add3 r3,r2,r3 ld.as r2,[gp,@z@sda] ;23 st.as r12,[gp,@x@sda] ;26 mov r13,0 add3 r2,r13,r2 st.as r3,[gp,@y@sda] ;26 st.as r2,[gp,@z@sda] ;26 j_s.d [blink] pop_s r13 where each shift by 3, uses ARC's add3 instruction, which is similar to x86's lea implementing x = (y<<3) + z, but requires the value zero to be placed in a temporary register "z". Splitting this before reload allows these pseudos to be shared/reused. With this patch, we get bar:ld.as r2,[gp,@x@sda] ;23 mov_s r3,0;3 add3r2,r3,r2 ld.as r3,[gp,@y@sda] ;23 st.as r2,[gp,@x@sda] ;26 ld.as r2,[gp,@z@sda] ;23 mov_s r12,0 ;3 add3r3,r12,r3 add3r2,r12,r2 st.as r3,[gp,@y@sda] ;26 st.as r2,[gp,@z@sda] ;26 j_s [blink] Unfortunately, register allocation means that we only share two of the three "mov_s z,0", but this is sufficient to reduce register pressure enough to avoid spilling r13 in the prologue/epilogue. This patch also contains a (latent?) bug fix. The implementation of the default insn "length" attribute, assumes instructions of type "shift" have two input operands and accesses operands[2], hence specializations of shifts that don't have a operands[2], need to be categorized as type "unary" (which results in the correct length). This patch has been tested on a cross-compiler to arc-elf (hosted on x86_64-pc-linux-gnu), but because I've an incomplete tool chain many of the regression test fail, but there are no new failures with new test cases added below. If you can confirm that there are no issues from additional testing, is this OK for mainline? Finally a quick technical question. ARC's zero overhead loops require at least two instructions in the loop, so currently the backend's implementation of shr20 pads the loop body with a "nop". lshr20: mov.f lp_count, 20 lpnz2f lsr r0,r0 nop 2: # end single insn loop j_s [blink] could this be more efficiently implemented as: lshr20: mov lp_count, 10 lp 2f lsr_s r0,r0 lsr_s r0,r0 2: # end single insn loop j_s [blink] i.e. half the number of iterations, but doing twice as much useful work in each iteration? Or might the nop be free on advanced microarchitectures, and/or the consecutive dependent shifts cause a pipeline stall? It would be nice to fuse loops to implement rotations, such that rotr16 (aka swap) would look like: rot16: mov_s r1,r0 mov lp_count, 16 lp 2f asl_s r0,r0 lsr_s r1,r1 2: # end single insn loop j_s.d[blink] or_s r0,r0,r1 Thanks in advance, Roger 2023-09-28 Roger Sayle gcc/ChangeLog * config/arc/arc-protos.h (emit_shift): Delete prototype. (arc_pre_reload_split): New function prototype. * config/arc/arc.cc (emit_shift): Delete function. (arc_pre_reload_split): New predicate function, copied from i386, to schedule define_insn_and_split splitters to the split1 pass. * config/arc/arc.md (ashlsi3): Expand RTL template unconditionally. (ashrsi3): Likewise. (lshrsi3): Likewise.