https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119368
Bug ID: 119368
Summary: immintrin code running slower with gcc than clang
Product: gcc
Version: 15.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: hubicka at gcc dot gnu.org
Target Milestone: ---
as mentioned in
https://www.root.cz/clanky/instrukcni-sady-simd-a-automaticke-vektorizace-provadene-prekladacem-gcc/nazory/#newIndex1
the following code runs faster when compiled by clang
int product(const char *a, const char *b)
{
__m512i sum = _mm512_setzero_si512();
for (size_t i = 0; i < 256; i += 64)
{
__m512i la = _mm512_loadu_si512(reinterpret_cast<const __m512i
*>(&a[i]));
__m512i lb = _mm512_loadu_si512(reinterpret_cast<const __m512i
*>(&b[i]));
__m512i a_low = _mm512_cvtepi8_epi16(_mm512_castsi512_si256(la));
__m512i b_low = _mm512_cvtepi8_epi16(_mm512_castsi512_si256(lb));
__m512i mul_low = _mm512_madd_epi16(a_low, b_low);
__m512i a_high = _mm512_cvtepi8_epi16(_mm512_extracti32x8_epi32(la,
1));
__m512i b_high = _mm512_cvtepi8_epi16(_mm512_extracti32x8_epi32(lb,
1));
__m512i mul_high = _mm512_madd_epi16(a_high, b_high);
sum = _mm512_add_epi32(sum, mul_low);
sum = _mm512_add_epi32(sum, mul_high);
}
return _mm512_reduce_add_epi32(sum);
}
https://godbolt.org/z/d4oE11red
It is due to splitting 512bit loads to 256bit loads (vpmovsxbw)