https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103641
Bug ID: 103641 Summary: [aarch64][11 regression] Severe compile time regression in SLP vectorize step Product: gcc Version: 11.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: husseydevin at gmail dot com Target Milestone: --- Created attachment 51966 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51966&action=edit aarch64-linux-gnu-gcc-11 -O3 -c xxhash.c -ftime-report -ftime-report-details While GCC 11.2 has been noticably better at NEON64 code, with some files it hangs for more than 15-30 seconds on the SLP vectorization step. I haven't narrowed this down to a specific thing yet because I don't know much about the GCC internals, but it is *extremely* noticeable in the xxHash library. (https://github.com/Cyan4973/xxHash). This is a test compiling xxhash.c from Git revision a17161efb1d2de151857277628678b0e0b486155. This was done on a Core i5-430m with 8GB RAM and an SSD on Debian Bullseye amd64. GCC 10 (10.2.1-6) was from the\repos, GCC 11 (11.2.0) was built from the tarball with similar flags. While this may cause bias, the two compilers get very similar times when the SLP vectorizer is off. $ time aarch64-linux-gnu-gcc-10 -O3 -c xxhash.c real 0m3.596s user 0m3.270s sys 0m0.149s $ time aarch64-linux-gnu-gcc-11 -O3 -c xxhash.c real 0m31.579s user 0m31.314s sys 0m0.112s When disabling the NEON intrinsics with `-DXXH_VECTOR=0`, it only takes ~21 seconds. Time variable usr sys wall GGC phase opt and generate : 31.46 ( 97%) 0.24 ( 32%) 31.80 ( 96%) 54M ( 63%) callgraph functions expansion : 31.01 ( 96%) 0.18 ( 24%) 31.29 ( 94%) 42M ( 49%) tree slp vectorization : 28.35 ( 88%) 0.03 ( 4%) 28.37 ( 85%) 9941k ( 11%) TOTAL : 32.34 0.75 33.20 86M This is significantly worse on my Pi 4B, where an ARMv7->AArch64 build took 3 minutes, although I presume that is mostly due to being 32-bit and the CPU being much slower.