[Bug rtl-optimization/97756] [11/12/13/14 Regression] Inefficient handling of 128-bit arguments
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97756 --- Comment #16 from CVS Commits --- The master branch has been updated by Jakub Jelinek : https://gcc.gnu.org/g:aad65285a1c681feb9fc5b041c86d841b24c3d2a commit r14-5442-gaad65285a1c681feb9fc5b041c86d841b24c3d2a Author: Jakub Jelinek Date: Tue Nov 14 13:19:48 2023 +0100 i386: Fix up 3_doubleword_lowpart [PR112523] On Sun, Nov 12, 2023 at 09:03:42PM -, Roger Sayle wrote: > This patch improves register pressure during reload, inspired by PR 97756. > Normally, a double-word right-shift by a constant produces a double-word > result, the highpart of which is dead when followed by a truncation. > The dead code calculating the high part gets cleaned up post-reload, so > the issue isn't normally visible, except for the increased register > pressure during reload, sometimes leading to odd register assignments. > Providing a post-reload splitter, which clobbers a single wordmode > result register instead of a doubleword result register, helps (a bit). Unfortunately this broke bootstrap on i686-linux, broke all ACATS tests on x86_64-linux as well as miscompiled e.g. __floattisf in libgcc there as well. The bug is that shrd{l,q} instruction expects the low part of the input to be the same register as the output, rather than the high part as the patch implemented. split_double_mode (mode, [1], 1, [1], [3]); sets operands[1] to the lo_half and operands[3] to the hi_half, so if operands[0] is not the same register as operands[1] (rather than [3]) after RA, we should during splitting move operands[1] into operands[0]. Your testcase: > #define MASK60 ((1ul << 60) - 1) > unsigned long foo (__uint128_t n) > { > unsigned long a = n & MASK60; > unsigned long b = (n >> 60); > b = b & MASK60; > unsigned long c = (n >> 120); > return a+b+c; > } still has the same number of instructions. Bootstrapped/regtested on x86_64-linux (where it e.g. turns === acats Summary === -# of unexpected failures 2328 +# of expected passes 2328 +# of unexpected failures 0 and fixes gcc.dg/torture/fp-int-convert-*timode.c FAILs as well) and i686-linux (where it previously didn't bootstrap, but compared to Friday evening's bootstrap the testresults are ok). 2023-11-14 Jakub Jelinek PR target/112523 PR ada/112514 * config/i386/i386.md (3_doubleword_lowpart): Move operands[1] aka low part of input rather than operands[3] aka high part of input to output if not the same register.
[Bug rtl-optimization/97756] [11/12/13/14 Regression] Inefficient handling of 128-bit arguments
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97756 --- Comment #15 from Thomas Koenig --- (In reply to CVS Commits from comment #14) > Admittedly a single "mov" isn't much of a saving on modern architectures, > but as demonstrated by the PR, people still track the number of them. Thanks :-)
[Bug rtl-optimization/97756] [11/12/13/14 Regression] Inefficient handling of 128-bit arguments
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97756 --- Comment #14 from CVS Commits --- The master branch has been updated by Roger Sayle : https://gcc.gnu.org/g:0a140730c970870a5125beb1114f6c01679a040e commit r14-5385-g0a140730c970870a5125beb1114f6c01679a040e Author: Roger Sayle Date: Mon Nov 13 09:05:16 2023 + i386: Improve reg pressure of double word right shift then truncate. This patch improves register pressure during reload, inspired by PR 97756. Normally, a double-word right-shift by a constant produces a double-word result, the highpart of which is dead when followed by a truncation. The dead code calculating the high part gets cleaned up post-reload, so the issue isn't normally visible, except for the increased register pressure during reload, sometimes leading to odd register assignments. Providing a post-reload splitter, which clobbers a single wordmode result register instead of a doubleword result register, helps (a bit). An example demonstrating this effect is: unsigned long foo (__uint128_t n) { unsigned long a = n & MASK60; unsigned long b = (n >> 60); b = b & MASK60; unsigned long c = (n >> 120); return a+b+c; } which currently with -O2 generates (13 instructions): foo:movabsq $1152921504606846975, %rcx xchgq %rdi, %rsi movq%rsi, %rax shrdq $60, %rdi, %rax movq%rax, %rdx movq%rsi, %rax movq%rdi, %rsi andq%rcx, %rax shrq$56, %rsi andq%rcx, %rdx addq%rsi, %rax addq%rdx, %rax ret with this patch, we generate one less mov (12 instructions): foo:movabsq $1152921504606846975, %rcx xchgq %rdi, %rsi movq%rdi, %rdx movq%rsi, %rax movq%rdi, %rsi shrdq $60, %rdi, %rdx andq%rcx, %rax shrq$56, %rsi addq%rsi, %rax andq%rcx, %rdx addq%rdx, %rax ret The significant difference is easier to see via diff: < shrdq $60, %rdi, %rax < movq%rax, %rdx --- > shrdq $60, %rdi, %rdx Admittedly a single "mov" isn't much of a saving on modern architectures, but as demonstrated by the PR, people still track the number of them. 2023-11-13 Roger Sayle gcc/ChangeLog * config/i386/i386.md (3_doubleword_lowpart): New define_insn_and_split to optimize register usage of doubleword right shifts followed by truncation.
[Bug rtl-optimization/97756] [11/12/13/14 Regression] Inefficient handling of 128-bit arguments
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97756 --- Comment #13 from Thomas Koenig --- (In reply to Patrick Palka from comment #3) > Perhaps related to this PR: On x86_64, the following basic wrapper around > int128 addition > > __uint128_t f(__uint128_t x, __uint128_t y) { return x + y; } > > gets compiled (/w -O3, -O2 or -Os) to the seemingly suboptimal > > movq%rdi, %r9 > movq%rdx, %rax > movq%rsi, %r8 > movq%rcx, %rdx > addq%r9, %rax > adcq%r8, %rdx > ret > > Clang does: > > movq%rdi, %rax > addq%rdx, %rax > adcq%rcx, %rsi > movq%rsi, %rdx > retq With current trunk, this is now movq%rdx, %rax movq%rcx, %rdx addq%rdi, %rax adcq%rsi, %rdx ret so it looks OK. The original test case regressed a bit, it is now 39 instructions.
[Bug rtl-optimization/97756] [11/12/13/14 Regression] Inefficient handling of 128-bit arguments
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97756 --- Comment #12 from Thomas Koenig --- (In reply to Andrew Pinski from comment #11) > This seems to be improved on trunk ... gcc is down to 37 instructions now for the original test case with -O3. icc, which appears to be best, has 33, see https://godbolt.org/z/461jeozs9 .
[Bug rtl-optimization/97756] [11/12/13/14 Regression] Inefficient handling of 128-bit arguments
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97756 Andrew Pinski changed: What|Removed |Added CC||roger at nextmovesoftware dot com --- Comment #11 from Andrew Pinski --- This seems to be improved on trunk ...
[Bug rtl-optimization/97756] [11/12/13/14 Regression] Inefficient handling of 128-bit arguments
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97756 Richard Biener changed: What|Removed |Added Target Milestone|10.5|11.5 --- Comment #10 from Richard Biener --- GCC 10 branch is being closed.