[Bug target/103393] [12 Regression] Generating 256bit register usage with -mprefer-avx128 -mprefer-vector-width=128
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103393 --- Comment #4 from John S --- I can Confirm from my side that it does appear to be the memmove inline expansion and not the auto vectorizer. It also occurs with builtin_memset/builtin_memcpy as well. For some context, this is an issue would prevent the usage of gcc in my production environment. It will certainly impact other use cases outside of my own as well. For example, it becomes impossible to use "-mno-vzeroupper -mavx -mpreferred-vector-width=128" and use _mm256_xxx + _mm256_zeroupper() intrinsics to properly manage the ymm state (clear or not) since the compiler is now able to insert ymm's almost anywhere via the memmove inlining. Up until now the prefer-width has always behaved as in a way that all auto generated vector uses will not exceed the preferred width. Only explicit use of the _mm256/_mm512_ .. intrinsics or the "vector types" i.e. `__m256 var; __m512 var;` would result in wider register usage. I do believe Clang/icc behave this way as well and there are dependencies on this behavior. The same also applies w/ avx-512 enabled with ZMM usage + prefer=128/256 where the downclocking issues can be even more pronounced.
[Bug tree-optimization/103393] New: [ 12 Regression ] Auto vectorizer generating 256bit register usage with -mprefer-avx128 -mprefer-vector-width=128
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103393 Bug ID: 103393 Summary: [ 12 Regression ] Auto vectorizer generating 256bit register usage with -mprefer-avx128 -mprefer-vector-width=128 Product: gcc Version: 12.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: jschoen4 at gmail dot com Target Milestone: --- gcc -v Using built-in specs. COLLECT_GCC=/gcc_build/bin/gcc COLLECT_LTO_WRAPPER=/gcc_build/bin/../libexec/gcc/x86_64-pc-linux-gnu/12.0.0/lto-wrapper Target: x86_64-pc-linux-gnu Configured with: ../configure --prefix=/gcc_build --include=/gcc_build/include --disable-multilib --enable-rpath --enable-__cxa_atexit --enable-nls --disable-checking --disable-libunwind-exceptions --enable-bootstrap --enable-shared --enable-static --enable-threads=posix --with-gcc --with-gnu-as --with-gnu-ld --with-system-zlib --enable-languages=c,c++,fortran,go,objc,obj-c++ --enable-lto --enable-stage1-languages=c Thread model: posix Supported LTO compression algorithms: zlib gcc version 12.0.0 20211123 (experimental) (GCC) Branch: trunk, w/ a latest commit of 721d8b9e26bf8205c1f2125c2626919a408cdbe4 === =TEST CODE= === # cat test.cpp struct TestData { float arr[8]; }; void cpy( TestData& s1, TestData& s2 ) { for(int i=0; i<8; ++i) { s1.arr[i] = s2.arr[i]; } } === =cmd = === gcc -S -masm=intel -O2 -mavx -mprefer-avx128 -mprefer-vector-width=128 -Wall -Wextra test.cpp -o test.s === =BAD ASM = = GCC 12 = === cat test.s .file "test.cpp" .intel_syntax noprefix .text .p2align 4 .globl _Z3cpyR8TestDataS0_ .type _Z3cpyR8TestDataS0_, @function _Z3cpyR8TestDataS0_: .LFB0: .cfi_startproc vmovdqu ymm0, YMMWORD PTR [rsi] vmovdqu YMMWORD PTR [rdi], ymm0 vzeroupper ret .cfi_endproc .LFE0: .size _Z3cpyR8TestDataS0_, .-_Z3cpyR8TestDataS0_ .ident "GCC: (GNU) 12.0.0 20211123 (experimental)" .section.note.GNU-stack,"",@progbits === = GCC 11 = (GCC 10 generates identical asm) === cat test.s .file "test.cpp" .intel_syntax noprefix .text .p2align 4 .globl _Z3cpyR8TestDataS0_ .type _Z3cpyR8TestDataS0_, @function _Z3cpyR8TestDataS0_: .LFB0: .cfi_startproc mov edx, 32 jmp memmove .cfi_endproc .LFE0: .size _Z3cpyR8TestDataS0_, .-_Z3cpyR8TestDataS0_ .ident "GCC: (GNU) 11.2.0" .section.note.GNU-stack,"",@progbits = = GCC 9 = = cat test.s .file "test.cpp" .intel_syntax noprefix .text .p2align 4 .globl _Z3cpyR8TestDataS0_ .type _Z3cpyR8TestDataS0_, @function _Z3cpyR8TestDataS0_: .LFB0: .cfi_startproc xor eax, eax .p2align 4,,10 .p2align 3 .L2: vmovss xmm0, DWORD PTR [rsi+rax] vmovss DWORD PTR [rdi+rax], xmm0 add rax, 4 cmp rax, 32 jne .L2 ret .cfi_endproc .LFE0: .size _Z3cpyR8TestDataS0_, .-_Z3cpyR8TestDataS0_ .ident "GCC: (GNU) 9.3.0" .section.note.GNU-stack,"",@progbits The auto vectorizer is generating YMM / 256-bit vector instructions with -mprefer-avx128 and -mprefer-vector-width=128 flags specified. This is an issue for low latency software. Using registers 256-bit and wider causes jitter CPU problems on sky lake / cascade lake / ice lake chips. This is true even in cases where the instructions used are considered avx256-light instructions due to a "mix of instructions" being used to determine the power levels (this is also mentioned in intel's optimization manual). Auto vectorizer needs to respect the prefer width flags. Enabling/using newer instruction sets i.e. AVX/AVX2/AVX512 does not require usage of the wider register types.
[Bug ipa/102554] [10/11 Regression] Inlining missed at -O3 with non-default --param=early-inlining-insns and pragma optimize
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102554 John S changed: What|Removed |Added Known to fail|12.0| --- Comment #3 from John S --- (In reply to Martin Liška from comment #2) > (In reply to Richard Biener from comment #1) > > I suspect that the optimize() attribute resets the param value to its > > default. > > Yes, it's fixed on master with g:r12-4038-g6de9f0c13b27c343. > > > > > Martin - can you investigate / bisect? > > Sure, it started with r10-4944-g1e83bd7003e03160. > > I tend closing that as fixed, what do you think Richi? I can confirm I am seeing g:r12-4038-g6de9f0c13b27c343 resolve the issue. Is it possible to get this applied into the upcoming 10.4, 11.3 releases? It's making upgrading to 10.x / 11.x versions challenging in certain latency sensitive production environments.
[Bug ipa/102554] New: [ 10/11/12 Regresion ] Inlining missed at -O3 with non-default --param=early-inlining-insns and pragma optimize
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102554 Bug ID: 102554 Summary: [ 10/11/12 Regresion ] Inlining missed at -O3 with non-default --param=early-inlining-insns and pragma optimize Product: gcc Version: 10.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: ipa Assignee: unassigned at gcc dot gnu.org Reporter: jschoen4 at gmail dot com CC: marxin at gcc dot gnu.org Target Milestone: --- GNU C++14 (GCC) version 10.2.0 (x86_64-pc-linux-gnu) compiled by GNU C version 10.2.0, GMP version 6.0.0, MPFR version 3.1.1, MPC version 1.0.1, isl version isl-0.16.1-GMP Target: x86_64-pc-linux-gnu Thread model: posix Supported LTO compression algorithms: zlib gcc version 10.2.0 (GCC) === =TEST CODE= === cat test.cpp #pragma GCC push_options #pragma GCC optimize ("no-lifetime-dse") class TestClass { public: static inline int should_inline() { return 10; } }; #pragma GCC pop_options int main() { return TestClass::should_inline() + 1; } === =cmd = === gcc-10 test.cpp -S --param=early-inlining-insns=30 -O3 -fno-lifetime-dse -Wall -Wextra === =BAD ASM = === cat test.s .file "test.cpp" .text .section .text._ZN9TestClass13should_inlineEv,"axG",@progbits,_ZN9TestClass13should_inlineEv,comdat .p2align 4 .weak _ZN9TestClass13should_inlineEv .type _ZN9TestClass13should_inlineEv, @function _ZN9TestClass13should_inlineEv: .LFB0: .cfi_startproc movl$10, %eax ret .cfi_endproc .LFE0: .size _ZN9TestClass13should_inlineEv, .-_ZN9TestClass13should_inlineEv .section.text.startup,"ax",@progbits .p2align 4 .globl main .type main, @function main: .LFB1: .cfi_startproc subq$8, %rsp .cfi_def_cfa_offset 16 call_ZN9TestClass13should_inlineEv addq$8, %rsp .cfi_def_cfa_offset 8 addl$1, %eax ret .cfi_endproc .LFE1: .size main, .-main .ident "GCC: (GNU) 10.2.0" .section.note.GNU-stack,"",@progbits === =info = === cat test.cpp.079i.inline ... Deciding on inlining of small functions. Starting with size 9. Enqueueing calls in int main()/1. test.cpp:13:34: missed: not inlinable: int main()/1 -> static int TestClass::should_inline()/0, optimization level attribute mismatch param_early_inlining_insns (0x1e/0xe) Enqueueing calls in static int TestClass::should_inline()/0. node context cache: 0 hits, 0 misses, 1 initializations ... === =GOOD ASM = === gcc-10 test.cpp -S --param=early-inlining-insns=14 -O3 -fno-lifetime-dse -Wall -Wextra .file "test.cpp" .text .section.text.startup,"ax",@progbits .p2align 4 .globl main .type main, @function main: .LFB1: .cfi_startproc movl$11, %eax ret .cfi_endproc .LFE1: .size main, .-main .ident "GCC: (GNU) 10.2.0" .section.note.GNU-stack,"",@progbits == =notes= == Starting with gcc 10+ (gcc9 works correctly), the use of --param=early-inlining-insns=30 and -O3 on the command line combined with using a "#pragma GCC optimize" in source code, even one that does not change the effective optimization attributes, causes "optimization level attribute mismatch" to occur in the inliner. In the example I placed both -fno-lifetime-dse on the command line and in the pragma gcc optimize ("no-lifetime-dse"), so it has no impact at all to the effective optimization attributes. The issue is not specific to using just pragma GCC optimize "no-lifetime-dse", any pragma gcc optimize line will have this effect. Even "unrecognized" ones. i.e. #pragma GCC optimize ("fake_attribute") Any value OTHER THAN --param=early-inlining-insns=14 on the command line when used with -O3 and pragma optimize will trigger this. .. i.e. == =optimize correctly = == gcc-10 test.cpp -S --param=early-inlining-insns=14 -O3 -fno-lifetime-dse -Wall -Wextra gcc-10 test.cpp -S --param=early-inlining-insns=30 -O2 -fno-lifetime-dse -Wall -Wextra gcc-9 test.cpp -S --param=early-inlining-insns=30 -O3 -fno-lifetime-dse -Wall -Wextra == =missed optimize = == gcc-10 test.cpp -S --param=early-inlining-insns=12 -O3 -fno-lifetime-dse -Wall -Wextra gcc-10 test.cpp -S --param=early-inlining-insns=17 -O3 -fno-lifetime-dse -Wall -Wextra etc. gcc-11 test.cpp -S --param=early-inlining-insns=30 -O3 -fno-lifetime-dse -Wall -Wextra gcc-12 test.cpp -S --param=early-inlining-insns=30 -O3 -fno-lifetime-dse -Wall -Wextra gcc-trunk test.cpp -S --param=early-inlining-insns=30 -O3