[Bug target/100347] [11/12 Regression] GCC 11 does not recognize skylake; translates "march=native" to "x86_64"

2021-05-07 Thread schnetter at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100347

--- Comment #15 from Erik Schnetter  ---
When I try to rebuild GCC 10.3 or 10.2, they end up having the same problem.
Also, when I enable bootstrapping, bootstrapping fails with differences in many
files. Given that this used to work on a previous version of the OS, the
problem isn't caused by GCC.

One thing that e.g. changed is that there is now a newer version of Apple
Clang.

Thank you for the help and suggestions.

[Bug target/100347] [11/12 Regression] GCC 11 does not recognize skylake; translates "march=native" to "x86_64"

2021-05-06 Thread schnetter at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100347

--- Comment #13 from Erik Schnetter  ---
The failing GCC 11.1.0 is built by Apple Clang 12.0.5 via Spack. Looking at
debug output, I see that Spack inserts a "-march=skylake" command line option.
(I was not aware of this before.) It does so by creating a compiler wrapper
(called "clang++" as well), which calls the actual compiler and adds this (and
some other) flags. 

I seem to recall having read somewhere that GCC's CPU detection code must be
built without any "-march=..." flag.

[Bug target/100347] [11/12 Regression] GCC 11 does not recognize skylake; translates "march=native" to "x86_64"

2021-05-06 Thread schnetter at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100347

--- Comment #12 from Erik Schnetter  ---
Yes, GCC 10.3 (built via MacPorts) still works. The sample program reports a
Skylake CPU with both compilers.

[Bug target/100347] [11/12 Regression] GCC 11 does not recognize skylake; translates "march=native" to "x86_64"

2021-04-30 Thread schnetter at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100347

--- Comment #5 from Erik Schnetter  ---
This is my hardware configuration:

$ sysctl -a | grep machdep.cpu
machdep.cpu.address_bits.physical: 39
machdep.cpu.address_bits.virtual: 48
machdep.cpu.arch_perf.events: 0
machdep.cpu.arch_perf.events_number: 7
machdep.cpu.arch_perf.fixed_number: 3
machdep.cpu.arch_perf.fixed_width: 48
machdep.cpu.arch_perf.number: 4
machdep.cpu.arch_perf.version: 4
machdep.cpu.arch_perf.width: 48
machdep.cpu.cache.L2_associativity: 4
machdep.cpu.cache.linesize: 64
machdep.cpu.cache.size: 256
machdep.cpu.mwait.extensions: 3
machdep.cpu.mwait.linesize_max: 64
machdep.cpu.mwait.linesize_min: 64
machdep.cpu.mwait.sub_Cstates: 286531872
machdep.cpu.thermal.ACNT_MCNT: 1
machdep.cpu.thermal.core_power_limits: 1
machdep.cpu.thermal.dynamic_acceleration: 1
machdep.cpu.thermal.energy_policy: 1
machdep.cpu.thermal.fine_grain_clock_mod: 1
machdep.cpu.thermal.hardware_feedback: 0
machdep.cpu.thermal.invariant_APIC_timer: 1
machdep.cpu.thermal.package_thermal_intr: 1
machdep.cpu.thermal.sensor: 1
machdep.cpu.thermal.thresholds: 2
machdep.cpu.tlb.data.small: 64
machdep.cpu.tlb.data.small_level1: 64
machdep.cpu.tlb.inst.large: 8
machdep.cpu.tsc_ccc.denominator: 2
machdep.cpu.tsc_ccc.numerator: 216
machdep.cpu.xsave.extended_state: 31 832 1088 0
machdep.cpu.xsave.extended_state1: 15 832 256 0
machdep.cpu.brand: 0
machdep.cpu.brand_string: Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz
machdep.cpu.core_count: 6
machdep.cpu.cores_per_package: 8
machdep.cpu.extfamily: 0
machdep.cpu.extfeature_bits: 1241984796928
machdep.cpu.extfeatures: SYSCALL XD 1GBPAGE EM64T LAHF LZCNT PREFETCHW RDTSCP
TSCI
machdep.cpu.extmodel: 9
machdep.cpu.family: 6
machdep.cpu.feature_bits: 9221960262849657855
machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA
CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ
DTES64 MON DSCPL VMX SMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC
MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C
machdep.cpu.leaf7_feature_bits: 43806655 1073741824
machdep.cpu.leaf7_feature_bits_edx: 2617255424
machdep.cpu.leaf7_features: RDWRFSGS TSC_THREAD_OFFSET SGX BMI1 HLE AVX2 SMEP
BMI2 ERMS INVPCID RTM FPU_CSDS MPX RDSEED ADX SMAP CLFSOPT IPT SGXLC MDCLEAR
TSXFA IBRS STIBP L1DF SSBD
machdep.cpu.logical_per_package: 16
machdep.cpu.max_basic: 22
machdep.cpu.max_ext: 2147483656
machdep.cpu.microcode_version: 222
machdep.cpu.model: 158
machdep.cpu.processor_flag: 5
machdep.cpu.signature: 591594
machdep.cpu.stepping: 10
machdep.cpu.thread_count: 12
machdep.cpu.vendor: GenuineIntel

[Bug driver/100347] GCC 11 does not recognize skylake; translates "march=native" to "x86_64"

2021-04-29 Thread schnetter at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100347

--- Comment #1 from Erik Schnetter  ---
Forgot to add: When I explicitly use "-march=skylake", everything works as
expected.

[Bug driver/100347] New: GCC 11 does not recognize skylake; translates "march=native" to "x86_64"

2021-04-29 Thread schnetter at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100347

Bug ID: 100347
   Summary: GCC 11 does not recognize skylake; translates
"march=native" to "x86_64"
   Product: gcc
   Version: 11.1.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: driver
  Assignee: unassigned at gcc dot gnu.org
  Reporter: schnetter at gmail dot com
  Target Milestone: ---

I just built GCC 11.1.0 (via Spack). I find that "-march=native" does not work
any more. It used to work with GCC 10.3 and earlier. The symptom is that
manually vectorized code does not compile any more.

This demonstrates the problem:

$ ./view-compilers/bin/gcc -march=native -Q --help=target | grep march
  -march=   x86-64
  Known valid arguments for -march= option:

This outputs "x86-64" where I expect "skylake".

$ ./view-compilers/bin/gcc --version
gcc (Spack GCC) 11.1.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

It works with an older version of GCC:

$ /opt/local/bin/gcc -march=native -Q --help=target | grep march
  -march=   skylake
  Known valid arguments for -march= option:

$ /opt/local/bin/gcc --version
gcc (MacPorts gcc10 10.3.0_0) 10.3.0
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

My system is a

$ uname -a
Darwin redshift.local 20.4.0 Darwin Kernel Version 20.4.0: Fri Mar  5 01:14:14
PST 2021; root:xnu-7195.101.1~3/RELEASE_X86_64 x86_64 i386 MacBookPro15,1
Darwin

[Bug target/99912] Unnecessary / inefficient spilling of AVX2 ymm registers

2021-04-27 Thread schnetter at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99912

--- Comment #11 from Erik Schnetter  ---
The number of active local variables is likely much larger than the number of
registers, and I expect there to be a lot of spilling. I hope that the compiler
is clever about changing the order in which expressions are evaluated to reduce
spilling as much as possible.

Because the loop is so large, I split it into two, each calculating about half
of the output variables. The code here looks at one of the loops. To simplify
the code, each loop still loads all variables (via masked loads), but may not
use all of them. The unused masked loads do not surprise me per se, but I
expect the compiler to remove them.

[Bug tree-optimization/100102] [8/9/10/11 Regression] ICE in tsubst, at cp/pt.c:15310

2021-04-16 Thread schnetter at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100102

--- Comment #6 from Erik Schnetter  ---
I looked for the string "GCC" in the user header files, but could not find any
place where things would differ between GCC 10.2 and 10.3. I assume there could
be a difference in GCC-provided header files (the error message mentions
"chrono" and "gcd"), or it could be that nvcc examines the GCC version and
produces different code.

[Bug tree-optimization/100102] ICE in tsubst, at cp/pt.c:15310

2021-04-15 Thread schnetter at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100102

--- Comment #1 from Erik Schnetter  ---
Created attachment 50605
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50605=edit
Compressed preprocessed source code

[Bug tree-optimization/100102] New: ICE in tsubst, at cp/pt.c:15310

2021-04-15 Thread schnetter at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100102

Bug ID: 100102
   Summary: ICE in tsubst, at cp/pt.c:15310
   Product: gcc
   Version: 10.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: schnetter at gmail dot com
  Target Milestone: ---

I am using GCC 10.3.0 on x86_64 GNU/Linux. GCC was built via Spack, and is
called from nvcc.

I encounter the following ICE:

cd
/tmp/eschnetter/spack-stage/spack-stage-amrex-21.04-eiivnj5bgmpnqg6o7ofgmy4yvdfgxasa/spack-build-eiivnj5/Src
&&
/home/eschnetter/src/CarpetX/spack/opt/spack/linux-ubuntu18.04-skylake_avx512/gcc-10.3.0/cuda-11.2.2-jbyezwujy3vielujb4xz3izwi6q36jnb/bin/nvcc
-forward-unknown-to-host-compiler
-ccbin=/home/eschnetter/src/CarpetX/Cactus/view-cuda-compilers/bin/g++
-Damrex_EXPORTS
-I/tmp/eschnetter/spack-stage/spack-stage-amrex-21.04-eiivnj5bgmpnqg6o7ofgmy4yvdfgxasa/spack-src/Src/Base
-I/tmp/eschnetter/spack-stage/spack-stage-amrex-21.04-eiivnj5bgmpnqg6o7ofgmy4yvdfgxasa/spack-src/Src/Boundary
-I/tmp/eschnetter/spack-stage/spack-stage-amrex-21.04-eiivnj5bgmpnqg6o7ofgmy4yvdfgxasa/spack-src/Src/AmrCore
-I/tmp/eschnetter/spack-stage/spack-stage-amrex-21.04-eiivnj5bgmpnqg6o7ofgmy4yvdfgxasa/spack-src/Src/Amr
-I/tmp/eschnetter/spack-stage/spack-stage-amrex-21.04-eiivnj5bgmpnqg6o7ofgmy4yvdfgxasa/spack-src/Src/LinearSolvers/MLMG
-I/tmp/eschnetter/spack-stage/spack-stage-amrex-21.04-eiivnj5bgmpnqg6o7ofgmy4yvdfgxasa/spack-src/Src/LinearSolvers/Projections
-I/tmp/eschnetter/spack-stage/spack-stage-amrex-21.04-eiivnj5bgmpnqg6o7ofgmy4yvdfgxasa/spack-src/Src/Particle
-I/tmp/eschnetter/spack-stage/spack-stage-amrex-21.04-eiivnj5bgmpnqg6o7ofgmy4yvdfgxasa/spack-build-eiivnj5
-isystem=/home/eschnetter/src/CarpetX/spack/opt/spack/linux-ubuntu18.04-skylake_avx512/gcc-10.3.0/openmpi-4.0.5-jl7qr7jpt3fe6z5rdfkgj2n4t5b4xbdn/include
-isystem=/home/eschnetter/src/CarpetX/spack/opt/spack/linux-ubuntu18.04-skylake_avx512/gcc-10.3.0/hdf5-1.10.7-gkflrn3su7geakoyly56sqebg2pqa2yr/include
-isystem=/home/eschnetter/src/CarpetX/spack/opt/spack/linux-ubuntu18.04-skylake_avx512/gcc-10.3.0/zlib-1.2.11-dd2emzewyp4o4c22f3niqq3dyhjhqkzs/include
-m64 --expt-relaxed-constexpr --expt-extended-lambda
-Wno-deprecated-gpu-targets -gencode=arch=compute_75,code=sm_75
-maxrregcount=255 -Xcudafe --diag_suppress=esa_on_defaulted_function_ignored
--use_fast_math -Xcudafe --display_error_number --Wext-lambda-captures-this
--Werror ext-lambda-captures-this --Werror cross-execution-space-call
--generate-line-info --source-in-ptx -O2 -g -DNDEBUG -Xcompiler=-fPIC
-Xcompiler=-fopenmp -Xcompiler=-Werror=return-type -Xcompiler -pthread
-std=c++14 -MD -MT Src/CMakeFiles/amrex.dir/Base/AMReX_BlockMutex.cpp.o -MF
CMakeFiles/amrex.dir/Base/AMReX_BlockMutex.cpp.o.d -x cu -dc
/tmp/eschnetter/spack-stage/spack-stage-amrex-21.04-eiivnj5bgmpnqg6o7ofgmy4yvdfgxasa/spack-src/Src/Base/AMReX_BlockMutex.cpp
-o CMakeFiles/amrex.dir/Base/AMReX_BlockMutex.cpp.o
/home/eschnetter/src/CarpetX/spack/opt/spack/linux-ubuntu18.04-skylake_avx512/gcc-10.1.0/gcc-10.3.0-74t7ecp2jgn6myrtnrziqo5hg6bncbb4/include/c++/10.3.0/chrono:
In substitution of 'template template using __is_harmonic =
std::__bool_constant<(std::ratio<((_Period2::num / std::chrono::duration<_Rep,
_Period>::_S_gcd(_Period2::num, _Period::num)) * (_Period::den /
std::chrono::duration<_Rep, _Period>::_S_gcd(_Period2::den, _Period::den))),
((_Period2::den / std::chrono::duration<_Rep, _Period>::_S_gcd(_Period2::den,
_Period::den)) * (_Period::num / std::chrono::duration<_Rep,
_Period>::_S_gcd(_Period2::num, _Period::num)))>::den == 1)> [with _Period2 =
_Period2; _Rep = _Rep; _Period = _Period]':
/home/eschnetter/src/CarpetX/spack/opt/spack/linux-ubuntu18.04-skylake_avx512/gcc-10.1.0/gcc-10.3.0-74t7ecp2jgn6myrtnrziqo5hg6bncbb4/include/c++/10.3.0/chrono:473:154:
  required from here
/home/eschnetter/src/CarpetX/spack/opt/spack/linux-ubuntu18.04-skylake_avx512/gcc-10.1.0/gcc-10.3.0-74t7ecp2jgn6myrtnrziqo5hg6bncbb4/include/c++/10.3.0/chrono:428:27:
internal compiler error: Segmentation fault
  428 |  _S_gcd(intmax_t __m, intmax_t __n) noexcept
  |   ^~
0xc5d6af crash_signal
   
/tmp/eschnetter/spack-stage/spack-stage-gcc-10.3.0-74t7ecp2jgn6myrtnrziqo5hg6bncbb4/spack-src/gcc/toplev.c:328
0x754d6d tsubst(tree_node*, tree_node*, int, tree_node*)
   
/tmp/eschnetter/spack-stage/spack-stage-gcc-10.3.0-74t7ecp2jgn6myrtnrziqo5hg6bncbb4/spack-src/gcc/cp/pt.c:15310
0x767d76 tsubst_template_args(tree_node*, tree_node*, int, tree_node*)
   
/tmp/eschnetter/spack-stage/spack-stage-gcc-10.3.0-74t7ecp2jgn6myrtnrziqo5hg6bncbb4/spack-src/gcc/cp/pt.c:13225
0x760766 tsubst_aggr_type
   
/tmp/eschnetter/spack-stage/spack-stage-gcc-10.3.0-74t7ecp2jgn6myrtnrziqo5hg6bncbb4/spack-src/gcc/cp/pt.c:13428
0x76aa5f tsubst_function_decl
   

[Bug target/99912] Unnecessary / inefficient spilling of AVX2 ymm registers

2021-04-06 Thread schnetter at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99912

--- Comment #5 from Erik Schnetter  ---
As you suggested, the problem is probably not caused by register spills, but by
stores into a struct that are not optimized away. In this case, the respective
struct elements are unused in the code.

I traced the results of the first __builtin_ia32_maskloadpd256:

  _63940 = __builtin_ia32_maskloadpd256 (_63955, prephitmp_86203);
  MEM  [(struct mat3 *) + 992B] = _63940;
  _178613 = .FMA (_63940, _64752, _178609);
  MEM  [(struct mat3 *) + 1312B] = _63940;

The respective struct locations (+ 992B, + 1312B) are indeed not used anywhere
else.

The struct is of type z4c_vars. It (and its parent) are defined in lines 279837
to 280818. It is large.

Is there e.g. a parameter I could set to make GCC try harder avoid unnecessary
stores?

[Bug target/99912] Unnecessary / inefficient spilling of AVX2 ymm registers

2021-04-06 Thread schnetter at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99912

--- Comment #4 from Erik Schnetter  ---
I build with the compiler options

/Users/eschnett/src/CarpetX/Cactus/view-compilers/bin/g++  -fopenmp -Wall -pipe
-g -march=skylake -std=gnu++17 -O3 -fcx-limited-range -fexcess-precision=fast
-fno-math-errno -fno-rounding-math -fno-signaling-nans
-funsafe-math-optimizations   -c -o configs/sim/build/Z4c/rhs.cxx.o
configs/sim/build/Z4c/rhs.cxx.ii

One of the kernels in question (the one I describe above) is the C++ lambda in
lines 281013 to 281119. The call to the "noinline" function ensures that the
kernel (and surrounding for loops) is compiled as a separate function, which
produces more efficient code. The function "grid.loop_int_device" contains
essentially three nested for loops, and the actual kernel is the C++ lambda in
lines 281015 to 281118.

I'll have a look at -fdump-tree-optimized.

[Bug target/99912] Unnecessary / inefficient spilling of AVX2 ymm registers

2021-04-04 Thread schnetter at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99912

--- Comment #2 from Erik Schnetter  ---
I did not describe the scale of the issue. There are more than just a few
inefficient or unnecessary operations:

The loop kernel (a single basic block) extends from address 0x1240 to 0xbf27 in
the attached disassembled object file.

Out of about 6000 instructions in the loop, 1000 are inefficient (and likely
superfluous) moves that copy one 32-byte stack slot into another, using 16-byte
wide copies.

For example, the stack slot 9376(%rsp) is written 9 times in the loop kernel,
but is read only once.

[Bug target/99912] Unnecessary / inefficient spilling of AVX2 ymm registers

2021-04-04 Thread schnetter at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99912

--- Comment #1 from Erik Schnetter  ---
Created attachment 50508
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50508=edit
Compressed disassembled object file

[Bug target/99912] New: Unnecessary / inefficient spilling of AVX2 ymm registers

2021-04-04 Thread schnetter at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99912

Bug ID: 99912
   Summary: Unnecessary / inefficient spilling of AVX2 ymm
registers
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: schnetter at gmail dot com
  Target Milestone: ---

Created attachment 50507
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50507=edit
Compressed preprocessed source code

I am using "g++ (Spack GCC) 11.0.1 20210404 (experimental)" (fresh checkout) on
MacOS 11.2.3 with a x86-64 Skylake CPU.

I am manually SIMD-vectorizing a loop kernel using AVX2 intrinsics. The
generated code is correct, but has obvious inefficiencies. I find these issues:

1. There are spills (?) of AVX2 ymm registers that are overwritten by another
spill a few instructions later, without being read in the mean time

2. The same register is spilled into multiple stack slots in consecutive
instructions

3. After spilling an ymm register, the stack slot is copied to another stack
slot, using xmm registers (i.e. using two loads/stores)

I tried to reproduce the issue in a small example, but failed. If this issue is
really due to spilling, then it might not be possible to have a small test
case.



Here is an example of issues 1 and 2; I show a few lines from the attached
disassembled file to clarify:
{{{
1520: c5 fd 29 8c 24 a0 24 00 00vmovapd %ymm1, 9376(%rsp)
1529: c5 fd 29 8c 24 20 29 00 00vmovapd %ymm1, 10528(%rsp)
1532: c5 fd 29 b4 24 80 28 00 00vmovapd %ymm6, 10368(%rsp)
153b: c5 fd 29 ac 24 a0 28 00 00vmovapd %ymm5, 10400(%rsp)
1544: c5 fd 29 a4 24 c0 28 00 00vmovapd %ymm4, 10432(%rsp)
154d: c5 fd 29 9c 24 e0 28 00 00vmovapd %ymm3, 10464(%rsp)
1556: c5 fd 29 94 24 00 29 00 00vmovapd %ymm2, 10496(%rsp)
155f: c4 a2 1d 2d 34 30 vmaskmovpd  (%rax,%r14), %ymm12,
%ymm6
1565: 48 8b 84 24 00 05 00 00   movq1280(%rsp), %rax
156d: c5 fd 29 b4 24 00 24 00 00vmovapd %ymm6, 9216(%rsp)
1576: c4 a2 1d 2d 2c 30 vmaskmovpd  (%rax,%r14), %ymm12,
%ymm5
157c: 48 8b 84 24 38 07 00 00   movq1848(%rsp), %rax
1584: c5 fd 29 ac 24 20 24 00 00vmovapd %ymm5, 9248(%rsp)
158d: c4 a2 1d 2d 24 30 vmaskmovpd  (%rax,%r14), %ymm12,
%ymm4
1593: 48 8b 84 24 60 04 00 00   movq1120(%rsp), %rax
159b: c5 fd 29 a4 24 40 24 00 00vmovapd %ymm4, 9280(%rsp)
15a4: c4 a2 1d 2d 1c 30 vmaskmovpd  (%rax,%r14), %ymm12,
%ymm3
15aa: 48 8b 84 24 68 04 00 00   movq1128(%rsp), %rax
15b2: c5 fd 29 9c 24 60 24 00 00vmovapd %ymm3, 9312(%rsp)
15bb: c4 a2 1d 2d 14 30 vmaskmovpd  (%rax,%r14), %ymm12,
%ymm2
15c1: c5 fd 29 94 24 80 24 00 00vmovapd %ymm2, 9344(%rsp)
15ca: 48 8b 84 24 08 05 00 00   movq1288(%rsp), %rax
15d2: c4 a2 1d 2d 0c 30 vmaskmovpd  (%rax,%r14), %ymm12,
%ymm1
15d8: 48 8b 84 24 70 04 00 00   movq1136(%rsp), %rax
15e0: c5 fd 29 8c 24 a0 24 00 00vmovapd %ymm1, 9376(%rsp)
15e9: c5 fd 29 b4 24 40 29 00 00vmovapd %ymm6, 10560(%rsp)
15f2: c5 fd 29 ac 24 60 29 00 00vmovapd %ymm5, 10592(%rsp)
15fb: c5 fd 29 a4 24 80 29 00 00vmovapd %ymm4, 10624(%rsp)
1604: c5 fd 29 9c 24 a0 29 00 00vmovapd %ymm3, 10656(%rsp)
160d: c5 fd 29 94 24 c0 29 00 00vmovapd %ymm2, 10688(%rsp)
1616: c5 fd 29 8c 24 e0 29 00 00vmovapd %ymm1, 10720(%rsp)
}}}

The beginning and end of this sample are what I think might be spill
instructions. The instruction at 1520 writes to 9376(%rsp), and the instruction
at 15e0 overwrites this stack slot. Also, the register %ymm1 is written
multiple times to different stack slots. (That by itself could be fine, but it
looks strange.)

A few instructions later I find this code:
{{{
16d7: c5 79 6f 84 24 80 28 00 00vmovdqa 10368(%rsp), %xmm8
16e0: c5 79 6f ac 24 20 29 00 00vmovdqa 10528(%rsp), %xmm13
16e9: c5 79 7f 84 24 e0 19 00 00vmovdqa %xmm8, 6624(%rsp)
16f2: c5 79 6f 84 24 90 28 00 00vmovdqa 10384(%rsp), %xmm8
16fb: c5 79 7f ac 24 80 1a 00 00vmovdqa %xmm13, 6784(%rsp)
1704: c5 79 7f 84 24 f0 19 00 00vmovdqa %xmm8, 6640(%rsp)
}}}
This copies the 32 bytes at 10368(%rsp) (written above), but uses %xmm8 to copy
the stack slot in 16-byte chunks. This shouldn't happen; there is no reason to
copy from one stack slot to another (presumably, since I know the code, but I
could be mistaken here). There is also no reason to copy in 16-byte chunks.
(All relevant local variables are ultimately of type __m256d, wrapped in C++
structs, and should thus be correctly aligned.)



To give some background information: The loop is quite large; it is part of a
complex numerical kernel for the Einstein equations