[Bug rtl-optimization/97756] [11/12/13/14 Regression] Inefficient handling of 128-bit arguments

2023-11-14 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97756

--- Comment #16 from CVS Commits  ---
The master branch has been updated by Jakub Jelinek :

https://gcc.gnu.org/g:aad65285a1c681feb9fc5b041c86d841b24c3d2a

commit r14-5442-gaad65285a1c681feb9fc5b041c86d841b24c3d2a
Author: Jakub Jelinek 
Date:   Tue Nov 14 13:19:48 2023 +0100

i386: Fix up 3_doubleword_lowpart [PR112523]

On Sun, Nov 12, 2023 at 09:03:42PM -, Roger Sayle wrote:
> This patch improves register pressure during reload, inspired by PR
97756.
> Normally, a double-word right-shift by a constant produces a double-word
> result, the highpart of which is dead when followed by a truncation.
> The dead code calculating the high part gets cleaned up post-reload, so
> the issue isn't normally visible, except for the increased register
> pressure during reload, sometimes leading to odd register assignments.
> Providing a post-reload splitter, which clobbers a single wordmode
> result register instead of a doubleword result register, helps (a bit).

Unfortunately this broke bootstrap on i686-linux, broke all ACATS tests
on x86_64-linux as well as miscompiled e.g. __floattisf in libgcc there
as well.

The bug is that shrd{l,q} instruction expects the low part of the input
to be the same register as the output, rather than the high part as the
patch implemented.
  split_double_mode (mode, [1], 1, [1],
[3]);
sets operands[1] to the lo_half and operands[3] to the hi_half, so if
operands[0] is not the same register as operands[1] (rather than [3]) after
RA, we should during splitting move operands[1] into operands[0].

Your testcase:
> #define MASK60 ((1ul << 60) - 1)
> unsigned long foo (__uint128_t n)
> {
>   unsigned long a = n & MASK60;
>   unsigned long b = (n >> 60);
>   b = b & MASK60;
>   unsigned long c = (n >> 120);
>   return a+b+c;
> }

still has the same number of instructions.

Bootstrapped/regtested on x86_64-linux (where it e.g. turns
=== acats Summary ===
-# of unexpected failures   2328
+# of expected passes   2328
+# of unexpected failures   0
and fixes gcc.dg/torture/fp-int-convert-*timode.c FAILs as well)
and i686-linux (where it previously didn't bootstrap, but compared to
Friday evening's bootstrap the testresults are ok).

2023-11-14  Jakub Jelinek  

PR target/112523
PR ada/112514
* config/i386/i386.md (3_doubleword_lowpart): Move
operands[1] aka low part of input rather than operands[3] aka high
part of input to output if not the same register.

[Bug rtl-optimization/97756] [11/12/13/14 Regression] Inefficient handling of 128-bit arguments

2023-11-13 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97756

--- Comment #15 from Thomas Koenig  ---
(In reply to CVS Commits from comment #14)

> Admittedly a single "mov" isn't much of a saving on modern architectures,
> but as demonstrated by the PR, people still track the number of them.

Thanks :-)

[Bug rtl-optimization/97756] [11/12/13/14 Regression] Inefficient handling of 128-bit arguments

2023-11-13 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97756

--- Comment #14 from CVS Commits  ---
The master branch has been updated by Roger Sayle :

https://gcc.gnu.org/g:0a140730c970870a5125beb1114f6c01679a040e

commit r14-5385-g0a140730c970870a5125beb1114f6c01679a040e
Author: Roger Sayle 
Date:   Mon Nov 13 09:05:16 2023 +

i386: Improve reg pressure of double word right shift then truncate.

This patch improves register pressure during reload, inspired by PR 97756.
Normally, a double-word right-shift by a constant produces a double-word
result, the highpart of which is dead when followed by a truncation.
The dead code calculating the high part gets cleaned up post-reload, so
the issue isn't normally visible, except for the increased register
pressure during reload, sometimes leading to odd register assignments.
Providing a post-reload splitter, which clobbers a single wordmode
result register instead of a doubleword result register, helps (a bit).

An example demonstrating this effect is:

unsigned long foo (__uint128_t n)
{
  unsigned long a = n & MASK60;
  unsigned long b = (n >> 60);
  b = b & MASK60;
  unsigned long c = (n >> 120);
  return a+b+c;
}

which currently with -O2 generates (13 instructions):
foo:movabsq $1152921504606846975, %rcx
xchgq   %rdi, %rsi
movq%rsi, %rax
shrdq   $60, %rdi, %rax
movq%rax, %rdx
movq%rsi, %rax
movq%rdi, %rsi
andq%rcx, %rax
shrq$56, %rsi
andq%rcx, %rdx
addq%rsi, %rax
addq%rdx, %rax
ret

with this patch, we generate one less mov (12 instructions):
foo:movabsq $1152921504606846975, %rcx
xchgq   %rdi, %rsi
movq%rdi, %rdx
movq%rsi, %rax
movq%rdi, %rsi
shrdq   $60, %rdi, %rdx
andq%rcx, %rax
shrq$56, %rsi
addq%rsi, %rax
andq%rcx, %rdx
addq%rdx, %rax
ret

The significant difference is easier to see via diff:
<   shrdq   $60, %rdi, %rax
<   movq%rax, %rdx
---
>   shrdq   $60, %rdi, %rdx

Admittedly a single "mov" isn't much of a saving on modern architectures,
but as demonstrated by the PR, people still track the number of them.

2023-11-13  Roger Sayle  

gcc/ChangeLog
* config/i386/i386.md (3_doubleword_lowpart): New
define_insn_and_split to optimize register usage of doubleword
right shifts followed by truncation.

[Bug rtl-optimization/97756] [11/12/13/14 Regression] Inefficient handling of 128-bit arguments

2023-11-07 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97756

--- Comment #13 from Thomas Koenig  ---
(In reply to Patrick Palka from comment #3)
> Perhaps related to this PR: On x86_64, the following basic wrapper around
> int128 addition
> 
>   __uint128_t f(__uint128_t x, __uint128_t y) { return x + y; }
> 
> gets compiled (/w -O3, -O2 or -Os) to the seemingly suboptimal
> 
> movq%rdi, %r9
> movq%rdx, %rax
> movq%rsi, %r8
> movq%rcx, %rdx
> addq%r9, %rax
> adcq%r8, %rdx
> ret
> 
> Clang does:
> 
> movq%rdi, %rax
> addq%rdx, %rax
> adcq%rcx, %rsi
> movq%rsi, %rdx
> retq

With current trunk, this is now

movq%rdx, %rax
movq%rcx, %rdx
addq%rdi, %rax
adcq%rsi, %rdx
ret

so it looks OK.

The original test case regressed a bit, it is now 39 instructions.

[Bug rtl-optimization/97756] [11/12/13/14 Regression] Inefficient handling of 128-bit arguments

2023-07-16 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97756

--- Comment #12 from Thomas Koenig  ---
(In reply to Andrew Pinski from comment #11)
> This seems to be improved on trunk ...

gcc is down to 37 instructions now for the original test case with -O3.
icc, which appears to be best, has 33, see https://godbolt.org/z/461jeozs9 .

[Bug rtl-optimization/97756] [11/12/13/14 Regression] Inefficient handling of 128-bit arguments

2023-07-13 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97756

Andrew Pinski  changed:

   What|Removed |Added

 CC||roger at nextmovesoftware dot 
com

--- Comment #11 from Andrew Pinski  ---
This seems to be improved on trunk ...

[Bug rtl-optimization/97756] [11/12/13/14 Regression] Inefficient handling of 128-bit arguments

2023-07-07 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97756

Richard Biener  changed:

   What|Removed |Added

   Target Milestone|10.5|11.5

--- Comment #10 from Richard Biener  ---
GCC 10 branch is being closed.