https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119368
Alexander Monakov <amonakov at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |amonakov at gcc dot gnu.org
--- Comment #1 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
I think it is not "splitting [...] to 256-bit loads" but rather splitting an
SSE4.1 extending load into full-width load followed by extension of the
just-loaded vector (throwing away half of the vector). The corresponding
intrinsics and built-ins are completely misdesigned, as they require writing
code as if a full vector loaded from memory:
#include <immintrin.h>
__m128i f(__m128i *x)
{
return _mm_cvtepi16_epi32(*x);
}
This minimal testcase demonstrates the fundamental issue with -O2 -msse4.1.
LLVM manages to fold the load, producing pmovsxwd (but with better designed
intrinsics the effort on the compiler side would be smaller).