On Fri, Oct 6, 2023 at 3:59 PM Roger Sayle <ro...@nextmovesoftware.com> wrote:
>
>
> Grr!  I've done it again.  ENOPATCH.
>
> > -----Original Message-----
> > From: Roger Sayle <ro...@nextmovesoftware.com>
> > Sent: 06 October 2023 14:58
> > To: 'gcc-patches@gcc.gnu.org' <gcc-patches@gcc.gnu.org>
> > Cc: 'Uros Bizjak' <ubiz...@gmail.com>
> > Subject: [X86 PATCH] Implement doubleword right shifts by 1 bit using
> s[ha]r+rcr.
> >
> >
> > This patch tweaks the i386 back-end's ix86_split_ashr and ix86_split_lshr
> > functions to implement doubleword right shifts by 1 bit, using a shift of
> the
> > highpart that sets the carry flag followed by a rotate-carry-right
> > (RCR) instruction on the lowpart.
> >
> > Conceptually this is similar to the recent left shift patch, but with two
> > complicating factors.  The first is that although the RCR sequence is
> shorter, and is
> > a ~3x performance improvement on AMD, my micro-benchmarking shows it
> > ~10% slower on Intel.  Hence this patch also introduces a new
> > X86_TUNE_USE_RCR tuning parameter.  The second is that I believe this is
> the
> > first time a "rotate-right-through-carry" and a right shift that sets the
> carry flag
> > from the least significant bit has been modelled in GCC RTL (on a MODE_CC
> > target).  For this I've used the i386 back-end's UNSPEC_CC_NE which seems
> > appropriate.  Finally rcrsi2 and rcrdi2 are separate define_insns so that
> we can
> > use their generator functions.
> >
> > For the pair of functions:
> > unsigned __int128 foo(unsigned __int128 x) { return x >> 1; }
> > __int128 bar(__int128 x) { return x >> 1; }
> >
> > with -O2 -march=znver4 we previously generated:
> >
> > foo:    movq    %rdi, %rax
> >         movq    %rsi, %rdx
> >         shrdq   $1, %rsi, %rax
> >         shrq    %rdx
> >         ret
> > bar:    movq    %rdi, %rax
> >         movq    %rsi, %rdx
> >         shrdq   $1, %rsi, %rax
> >         sarq    %rdx
> >         ret
> >
> > with this patch we now generate:
> >
> > foo:    movq    %rsi, %rdx
> >         movq    %rdi, %rax
> >         shrq    %rdx
> >         rcrq    %rax
> >         ret
> > bar:    movq    %rsi, %rdx
> >         movq    %rdi, %rax
> >         sarq    %rdx
> >         rcrq    %rax
> >         ret
> >
> > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and
> > make -k check, both with and without --target_board=unix{-m32} with no new
> > failures.  And to provide additional testing, I've also bootstrapped and
> regression
> > tested a version of this patch where the RCR is always generated
> (independent of
> > the -march target) again with no regressions.  Ok for mainline?
> >
> >
> > 2023-10-06  Roger Sayle  <ro...@nextmovesoftware.com>
> >
> > gcc/ChangeLog
> >         * config/i386/i386-expand.c (ix86_split_ashr): Split shifts by
> >         one into ashr[sd]i3_carry followed by rcr[sd]i2, if TARGET_USE_RCR
> >         or -Oz.
> >         (ix86_split_lshr): Likewise, split shifts by one bit into
> >         lshr[sd]i3_carry followed by rcr[sd]i2, if TARGET_USE_RCR or -Oz.
> >         * config/i386/i386.h (TARGET_USE_RCR): New backend macro.
> >         * config/i386/i386.md (rcrsi2): New define_insn for rcrl.
> >         (rcrdi2): New define_insn for rcrq.
> >         (<anyshiftrt><mode>3_carry): New define_insn for right shifts that
> >         set the carry flag from the least significant bit, modelled using
> >         UNSPEC_CC_NE.
> >         * config/i386/x86-tune.def (X86_TUNE_USE_RCR): New tuning
> parameter
> >         controlling use of rcr 1 vs. shrd, which is significantly faster
> on
> >         AMD processors.
> >
> > gcc/testsuite/ChangeLog
> >         * gcc.target/i386/rcr-1.c: New 64-bit test case.
> >         * gcc.target/i386/rcr-2.c: New 32-bit test case.

OK.

Just don't set the new tune for generic. I hope Intel people notice
the performance difference...

Thanks,
Uros.

Reply via email to