https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93594
Bug ID: 93594 Summary: Missed optimization with _mm256_set/setr_m128i intrinsics Product: gcc Version: 9.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: andysem at mail dot ru Target Milestone: --- When _mm256_set_m128i/_mm256_setr_m128i intrinsics are used to zero the upper half of the resulting register, gcc generates unnecessary vinserti128 instruction, where a single vmovdqa would be enough. The compiler is able to recognize "_mm256_insertf128_si256(_mm256_setzero_si256(), low, 0)" pattern but not "_mm256_insertf128_si256(_mm256_castsi128_si256(low), _mm_setzero_si128(), 1)". You can see code generated for the different pieces of code here: https://gcc.godbolt.org/z/ZMwtPq Note that clang is able to recognize all versions and generates optimal code in all cases. For convenience, here is the test code: #include <immintrin.h> __m256i cvt_setr(__m128i low) { return _mm256_setr_m128i(low, _mm_setzero_si128()); } __m256i cvt_set(__m128i low) { return _mm256_set_m128i(_mm_setzero_si128(), low); } __m256i cvt_insert(__m128i low) { return _mm256_insertf128_si256(_mm256_setzero_si256(), low, 0); } __m256i cvt_insert_v2(__m128i low) { return _mm256_insertf128_si256(_mm256_castsi128_si256(low), _mm_setzero_si128(), 1); } $ g++ -O3 -mavx2