https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107892

--- Comment #1 from Eric Biggers <ebiggers3 at gmail dot com> ---
The reproducer I gave in my first comment doesn't reproduce the bug on
releases/gcc-11.1.0, so it must have regressed between then and trunk.  I can
do a bisection if needed.

However, I actually still see the bug with gcc-11.1.0 on my original
unminimized code at
https://github.com/ebiggers/libdeflate/blob/fb0c43373f6fe600471457f4c021b8ad7e4bbabf/lib/x86/adler32_impl.h#L142.
 So maybe the reproducer I gave is not the best one.  Here is a slightly
different reproducer that reproduces the bug with both gcc-11.1.0 and trunk:

        #include <immintrin.h>

        __m256i __attribute__((target("avx2")))
        f(const __m256i *p, size_t n)
        {
                __m256i a = _mm256_setzero_si256();

                do {
                        a = _mm256_add_epi32(a, *p++);
                } while (--n);

                return _mm256_madd_epi16(a, a);
        }

The assembly of the loop has the unnecessary vmovdqa:

   8:   c5 f5 fe 07             vpaddd (%rdi),%ymm1,%ymm0
   c:   48 83 c7 20             add    $0x20,%rdi
  10:   c5 fd 6f c8             vmovdqa %ymm0,%ymm1
  14:   48 83 ee 01             sub    $0x1,%rsi
  18:   75 ee                   jne    8 <f+0x8>

Reply via email to