[Bug target/103393] [12 Regression] Generating 256bit register usage with -mprefer-avx128 -mprefer-vector-width=128

2021-11-24 Thread jschoen4 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103393

--- Comment #4 from John S  ---
I can Confirm from my side that it does appear to be the memmove inline
expansion and not the auto vectorizer.  It also occurs with
builtin_memset/builtin_memcpy as well.

For some context, this is an issue would prevent the usage of gcc in my
production environment.  It will certainly impact other use cases outside of my
own as well.  For example, it becomes impossible to use "-mno-vzeroupper -mavx
-mpreferred-vector-width=128" and use _mm256_xxx + _mm256_zeroupper()
intrinsics to properly manage the ymm state (clear or not) since the compiler
is now able to insert ymm's almost anywhere via the memmove inlining.

Up until now the prefer-width has always behaved as in a way that all auto
generated vector uses will not exceed the preferred width.  Only explicit use
of the _mm256/_mm512_ .. intrinsics or the "vector types" i.e. `__m256 var;
__m512 var;` would result in wider register usage.

I do believe Clang/icc behave this way as well and there are dependencies on
this behavior.  The same also applies w/ avx-512 enabled with ZMM usage +
prefer=128/256 where the downclocking issues can be even more pronounced.

[Bug tree-optimization/103393] New: [ 12 Regression ] Auto vectorizer generating 256bit register usage with -mprefer-avx128 -mprefer-vector-width=128

2021-11-23 Thread jschoen4 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103393

Bug ID: 103393
   Summary: [ 12 Regression ] Auto vectorizer generating 256bit
register usage with -mprefer-avx128
-mprefer-vector-width=128
   Product: gcc
   Version: 12.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jschoen4 at gmail dot com
  Target Milestone: ---

gcc -v
Using built-in specs.
COLLECT_GCC=/gcc_build/bin/gcc
COLLECT_LTO_WRAPPER=/gcc_build/bin/../libexec/gcc/x86_64-pc-linux-gnu/12.0.0/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: ../configure --prefix=/gcc_build --include=/gcc_build/include
--disable-multilib --enable-rpath --enable-__cxa_atexit --enable-nls
--disable-checking --disable-libunwind-exceptions --enable-bootstrap
--enable-shared --enable-static --enable-threads=posix --with-gcc --with-gnu-as
--with-gnu-ld --with-system-zlib
--enable-languages=c,c++,fortran,go,objc,obj-c++ --enable-lto
--enable-stage1-languages=c
Thread model: posix
Supported LTO compression algorithms: zlib
gcc version 12.0.0 20211123 (experimental) (GCC)

Branch: trunk, w/ a latest commit of 721d8b9e26bf8205c1f2125c2626919a408cdbe4

===
=TEST CODE=
===
# cat test.cpp
struct TestData {
  float arr[8];
};
void cpy( TestData& s1, TestData& s2 ) {
  for(int i=0; i<8; ++i) {
s1.arr[i] = s2.arr[i];
  }
}

===
=cmd  =
===
gcc -S -masm=intel -O2 -mavx -mprefer-avx128 -mprefer-vector-width=128 -Wall
-Wextra test.cpp -o test.s

===
=BAD ASM  =
= GCC 12  =
===
cat test.s
.file   "test.cpp"
.intel_syntax noprefix
.text
.p2align 4
.globl  _Z3cpyR8TestDataS0_
.type   _Z3cpyR8TestDataS0_, @function
_Z3cpyR8TestDataS0_:
.LFB0:
.cfi_startproc
vmovdqu ymm0, YMMWORD PTR [rsi]
vmovdqu YMMWORD PTR [rdi], ymm0
vzeroupper
ret
.cfi_endproc
.LFE0:
.size   _Z3cpyR8TestDataS0_, .-_Z3cpyR8TestDataS0_
.ident  "GCC: (GNU) 12.0.0 20211123 (experimental)"
.section.note.GNU-stack,"",@progbits

===
= GCC 11  = (GCC 10 generates identical asm)
===
cat test.s
.file   "test.cpp"
.intel_syntax noprefix
.text
.p2align 4
.globl  _Z3cpyR8TestDataS0_
.type   _Z3cpyR8TestDataS0_, @function
_Z3cpyR8TestDataS0_:
.LFB0:
.cfi_startproc
mov edx, 32
jmp memmove
.cfi_endproc
.LFE0:
.size   _Z3cpyR8TestDataS0_, .-_Z3cpyR8TestDataS0_
.ident  "GCC: (GNU) 11.2.0"
.section.note.GNU-stack,"",@progbits

=
= GCC 9 =
=
cat test.s
.file   "test.cpp"
.intel_syntax noprefix
.text
.p2align 4
.globl  _Z3cpyR8TestDataS0_
.type   _Z3cpyR8TestDataS0_, @function
_Z3cpyR8TestDataS0_:
.LFB0:
.cfi_startproc
xor eax, eax
.p2align 4,,10
.p2align 3
.L2:
vmovss  xmm0, DWORD PTR [rsi+rax]
vmovss  DWORD PTR [rdi+rax], xmm0
add rax, 4
cmp rax, 32
jne .L2
ret
.cfi_endproc
.LFE0:
.size   _Z3cpyR8TestDataS0_, .-_Z3cpyR8TestDataS0_
.ident  "GCC: (GNU) 9.3.0"
.section.note.GNU-stack,"",@progbits




The auto vectorizer is generating YMM / 256-bit vector instructions with
-mprefer-avx128 and -mprefer-vector-width=128 flags specified.  This is an
issue for low latency software. Using registers 256-bit and wider causes jitter
CPU problems on sky lake / cascade lake / ice lake chips.  This is true even in
cases where the instructions used are considered avx256-light instructions due
to a "mix of instructions" being used to determine the power levels (this is
also mentioned in intel's optimization manual).

Auto vectorizer needs to respect the prefer width flags.  Enabling/using newer
instruction sets i.e. AVX/AVX2/AVX512 does not require usage of the wider
register types.

[Bug ipa/102554] [10/11 Regression] Inlining missed at -O3 with non-default --param=early-inlining-insns and pragma optimize

2021-10-04 Thread jschoen4 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102554

John S  changed:

   What|Removed |Added

  Known to fail|12.0|

--- Comment #3 from John S  ---
(In reply to Martin Liška from comment #2)
> (In reply to Richard Biener from comment #1)
> > I suspect that the optimize() attribute resets the param value to its
> > default.
> 
> Yes, it's fixed on master with g:r12-4038-g6de9f0c13b27c343.
> 
> > 
> > Martin - can you investigate / bisect?
> 
> Sure, it started with r10-4944-g1e83bd7003e03160.
> 
> I tend closing that as fixed, what do you think Richi?

I can confirm I am seeing g:r12-4038-g6de9f0c13b27c343 resolve the issue.

Is it possible to get this applied into the upcoming 10.4, 11.3 releases?  It's
making upgrading to 10.x / 11.x versions challenging in certain latency
sensitive production environments.

[Bug ipa/102554] New: [ 10/11/12 Regresion ] Inlining missed at -O3 with non-default --param=early-inlining-insns and pragma optimize

2021-10-01 Thread jschoen4 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102554

Bug ID: 102554
   Summary: [ 10/11/12 Regresion ] Inlining missed at -O3 with
non-default --param=early-inlining-insns and pragma
optimize
   Product: gcc
   Version: 10.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: ipa
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jschoen4 at gmail dot com
CC: marxin at gcc dot gnu.org
  Target Milestone: ---

GNU C++14 (GCC) version 10.2.0 (x86_64-pc-linux-gnu)
compiled by GNU C version 10.2.0, GMP version 6.0.0, MPFR version
3.1.1, MPC version 1.0.1, isl version isl-0.16.1-GMP

Target: x86_64-pc-linux-gnu
Thread model: posix
Supported LTO compression algorithms: zlib
gcc version 10.2.0 (GCC)

===
=TEST CODE=
===
cat test.cpp
#pragma GCC push_options
#pragma GCC optimize ("no-lifetime-dse")
class TestClass
{
public:
  static inline int should_inline() {
return 10;
  }
};
#pragma GCC pop_options

int main() {
  return TestClass::should_inline() + 1;
}

===
=cmd  =
===
gcc-10 test.cpp -S --param=early-inlining-insns=30 -O3 -fno-lifetime-dse -Wall
-Wextra

===
=BAD ASM  =
===
cat test.s
.file   "test.cpp"
.text
.section   
.text._ZN9TestClass13should_inlineEv,"axG",@progbits,_ZN9TestClass13should_inlineEv,comdat
.p2align 4
.weak   _ZN9TestClass13should_inlineEv
.type   _ZN9TestClass13should_inlineEv, @function
_ZN9TestClass13should_inlineEv:
.LFB0:
.cfi_startproc
movl$10, %eax
ret
.cfi_endproc
.LFE0:
.size   _ZN9TestClass13should_inlineEv,
.-_ZN9TestClass13should_inlineEv
.section.text.startup,"ax",@progbits
.p2align 4
.globl  main
.type   main, @function
main:
.LFB1:
.cfi_startproc
subq$8, %rsp
.cfi_def_cfa_offset 16
call_ZN9TestClass13should_inlineEv
addq$8, %rsp
.cfi_def_cfa_offset 8
addl$1, %eax
ret
.cfi_endproc
.LFE1:
.size   main, .-main
.ident  "GCC: (GNU) 10.2.0"
.section.note.GNU-stack,"",@progbits

===
=info =
===
cat test.cpp.079i.inline
...
Deciding on inlining of small functions.  Starting with size 9.
Enqueueing calls in int main()/1.
test.cpp:13:34: missed:   not inlinable: int main()/1 -> static int
TestClass::should_inline()/0, optimization level attribute mismatch

  param_early_inlining_insns (0x1e/0xe)
Enqueueing calls in static int TestClass::should_inline()/0.
node context cache: 0 hits, 0 misses, 1 initializations
...

===
=GOOD ASM =
===
gcc-10 test.cpp -S --param=early-inlining-insns=14 -O3 -fno-lifetime-dse -Wall
-Wextra
.file   "test.cpp"
.text
.section.text.startup,"ax",@progbits
.p2align 4
.globl  main
.type   main, @function
main:
.LFB1:
.cfi_startproc
movl$11, %eax
ret
.cfi_endproc
.LFE1:
.size   main, .-main
.ident  "GCC: (GNU) 10.2.0"
.section.note.GNU-stack,"",@progbits


==
=notes=
==

Starting with gcc 10+ (gcc9 works correctly), the use of
--param=early-inlining-insns=30 and -O3 on the command line combined with using
a "#pragma GCC optimize" in source code, even one that does not change the
effective optimization attributes, causes "optimization level attribute
mismatch" to occur in the inliner.

In the example I placed both -fno-lifetime-dse on the command line and in the
pragma gcc optimize ("no-lifetime-dse"),  so it has no impact at all to the
effective optimization attributes.  

The issue is not specific to using just pragma GCC optimize "no-lifetime-dse",
any pragma gcc optimize line will have this effect. Even "unrecognized" ones. 
i.e.
#pragma GCC optimize ("fake_attribute")

Any value OTHER THAN --param=early-inlining-insns=14 on the command line when
used with -O3 and pragma optimize will trigger this.
.. i.e.
==
=optimize correctly  =
==
gcc-10 test.cpp -S --param=early-inlining-insns=14 -O3 -fno-lifetime-dse -Wall
-Wextra

gcc-10 test.cpp -S --param=early-inlining-insns=30 -O2 -fno-lifetime-dse -Wall
-Wextra

gcc-9 test.cpp -S --param=early-inlining-insns=30 -O3 -fno-lifetime-dse -Wall
-Wextra

==
=missed optimize =
==
gcc-10 test.cpp -S --param=early-inlining-insns=12 -O3 -fno-lifetime-dse -Wall
-Wextra

gcc-10 test.cpp -S --param=early-inlining-insns=17 -O3 -fno-lifetime-dse -Wall
-Wextra
etc.

gcc-11 test.cpp -S --param=early-inlining-insns=30 -O3 -fno-lifetime-dse -Wall
-Wextra
gcc-12 test.cpp -S --param=early-inlining-insns=30 -O3 -fno-lifetime-dse -Wall
-Wextra
gcc-trunk test.cpp -S --param=early-inlining-insns=30 -O3