On Tue, Jun 17, 2025 at 6:40 AM Andy Fan <zhihuifan1...@163.com> wrote: > > "Devulapalli, Raghuveer" <raghuveer.devulapa...@intel.com> writes: > > > Great catch! From the intrinsic manual: > > > > Cast vector of type __m128i to type __m512i; the upper 384 bits of the > > result are undefined.
Thanks Raghuveer and Nathan, for the diagnosis! > Just be curious, what kind of optimization (like what -O2 does) could > mask this issue? In case Andy is asking about "how" rather than "under what circumstances", my guess is: -O1+ may have just chosen instructions that also happen to zero-extend, which are common. -O0 doesn't represent the naive straightforward structure of what the programmer wrote, it's more like an "exploded" representation suitable for later optimization passes. That's why it always looks goofy. > > Replacing that with _mm512_zextsi128_si512 fixes the problem. Here's a patch for testing, which also reverts the previous workaround. Help welcome, but I still promise to test it in the near future regardless. -- John Naylor Amazon Web Services
diff --git a/src/port/pg_crc32c_sse42.c b/src/port/pg_crc32c_sse42.c index 9af3474a6ca..1a717255355 100644 --- a/src/port/pg_crc32c_sse42.c +++ b/src/port/pg_crc32c_sse42.c @@ -123,7 +123,7 @@ pg_comp_crc32c_avx512(pg_crc32c crc, const void *data, size_t len) __m512i k; k = _mm512_broadcast_i32x4(_mm_setr_epi32(0x740eef02, 0, 0x9e4addf8, 0)); - x0 = _mm512_xor_si512(_mm512_castsi128_si512(_mm_cvtsi32_si128(crc0)), x0); + x0 = _mm512_xor_si512(_mm512_zextsi128_si512(_mm_cvtsi32_si128(crc0)), x0); buf += 64; /* Main loop. */ diff --git a/src/port/pg_crc32c_sse42_choose.c b/src/port/pg_crc32c_sse42_choose.c index 802e47788c1..74d2421ba2b 100644 --- a/src/port/pg_crc32c_sse42_choose.c +++ b/src/port/pg_crc32c_sse42_choose.c @@ -95,9 +95,7 @@ pg_comp_crc32c_choose(pg_crc32c crc, const void *data, size_t len) __cpuidex(exx, 7, 0); #endif -#if defined(__clang__) && !defined(__OPTIMIZE__) - /* Some versions of clang are broken at -O0 */ -#elif defined(USE_AVX512_CRC32C_WITH_RUNTIME_CHECK) +#ifdef USE_AVX512_CRC32C_WITH_RUNTIME_CHECK if (exx[2] & (1 << 10) && /* VPCLMULQDQ */ exx[1] & (1 << 31)) /* AVX512-VL */ pg_comp_crc32c = pg_comp_crc32c_avx512;