https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107892
Bug ID: 107892 Summary: Unnecessary move between ymm registers in loop using AVX2 intrinsic Product: gcc Version: 13.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: ebiggers3 at gmail dot com Target Milestone: --- To reproduce with the latest trunk, compile the following .c file on x86_64 at -O2: #include <immintrin.h> int __attribute__((target("avx2"))) sum_ints(const __m256i *p, size_t n) { __m256i a = _mm256_setzero_si256(); __m128i b; do { a = _mm256_add_epi32(a, *p++); } while (--n); b = _mm_add_epi32(_mm256_extracti128_si256(a, 0), _mm256_extracti128_si256(a, 1)); b = _mm_add_epi32(b, _mm_shuffle_epi32(b, 0x31)); b = _mm_add_epi32(b, _mm_shuffle_epi32(b, 0x02)); return _mm_cvtsi128_si32(b); } The assembly that gcc generates is: 0000000000000000 <sum_ints>: 0: c5 f1 ef c9 vpxor %xmm1,%xmm1,%xmm1 4: 0f 1f 40 00 nopl 0x0(%rax) 8: c5 f5 fe 07 vpaddd (%rdi),%ymm1,%ymm0 c: 48 83 c7 20 add $0x20,%rdi 10: c5 fd 6f c8 vmovdqa %ymm0,%ymm1 14: 48 83 ee 01 sub $0x1,%rsi 18: 75 ee jne 8 <sum_ints+0x8> 1a: c4 e3 7d 39 c1 01 vextracti128 $0x1,%ymm0,%xmm1 20: c5 f9 fe c1 vpaddd %xmm1,%xmm0,%xmm0 24: c5 f9 70 c8 31 vpshufd $0x31,%xmm0,%xmm1 29: c5 f1 fe c8 vpaddd %xmm0,%xmm1,%xmm1 2d: c5 f9 70 c1 02 vpshufd $0x2,%xmm1,%xmm0 32: c5 f9 fe c1 vpaddd %xmm1,%xmm0,%xmm0 36: c5 f9 7e c0 vmovd %xmm0,%eax 3a: c5 f8 77 vzeroupper 3d: c3 ret The bug is that the inner loop contains an unnecessary vmovdqa: 8: vpaddd (%rdi),%ymm1,%ymm0 add $0x20,%rdi vmovdqa %ymm0,%ymm1 sub $0x1,%rsi jne 8 <sum_ints+0x8> It should look like the following instead: 8: vpaddd (%rdi),%ymm0,%ymm0 add $0x20,%rdi sub $0x1,%rsi jne 8 <sum_ints+0x8> Strangely, the bug goes away if the __v8si type is used instead of __m256i and the addition is done using "+=" instead of _mm256_add_epi32(): int __attribute__((target("avx2"))) sum_ints_good(const __v8si *p, size_t n) { __v8si a = {}; __m128i b; do { a += *p++; } while (--n); b = _mm_add_epi32(_mm256_extracti128_si256((__m256i)a, 0), _mm256_extracti128_si256((__m256i)a, 1)); b = _mm_add_epi32(b, _mm_shuffle_epi32(b, 0x31)); b = _mm_add_epi32(b, _mm_shuffle_epi32(b, 0x02)); return _mm_cvtsi128_si32(b); } In the bad version, I noticed that the RTL initially has two separate insns for 'a += *p': one to do the addition and write the result to a new pseudo register, and one to convert the value from mode V8SI to V4DI and assign it to the original pseudo register. These two separate insns never get combined. (That sort of explains why the bug isn't seen with the __v8si and += method; gcc doesn't do a type conversion with that method.) So, I'm wondering if the bug is in the instruction combining pass. Or perhaps the RTL should never have had two separate insns in the first place?