https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93594

            Bug ID: 93594
           Summary: Missed optimization with _mm256_set/setr_m128i
                    intrinsics
           Product: gcc
           Version: 9.2.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: andysem at mail dot ru
  Target Milestone: ---

When _mm256_set_m128i/_mm256_setr_m128i intrinsics are used to zero the upper
half of the resulting register, gcc generates unnecessary vinserti128
instruction, where a single vmovdqa would be enough. The compiler is able to
recognize "_mm256_insertf128_si256(_mm256_setzero_si256(), low, 0)" pattern but
not "_mm256_insertf128_si256(_mm256_castsi128_si256(low), _mm_setzero_si128(),
1)".

You can see code generated for the different pieces of code here:
https://gcc.godbolt.org/z/ZMwtPq

Note that clang is able to recognize all versions and generates optimal code in
all cases.

For convenience, here is the test code:

#include <immintrin.h>

__m256i cvt_setr(__m128i low)
{
    return _mm256_setr_m128i(low, _mm_setzero_si128());
}

__m256i cvt_set(__m128i low)
{
    return _mm256_set_m128i(_mm_setzero_si128(), low);
}

__m256i cvt_insert(__m128i low)
{
    return _mm256_insertf128_si256(_mm256_setzero_si256(), low, 0);
}

__m256i cvt_insert_v2(__m128i low)
{
    return _mm256_insertf128_si256(_mm256_castsi128_si256(low),
_mm_setzero_si128(), 1);
}

$ g++ -O3 -mavx2

Reply via email to