https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84508

--- Comment #24 from Peter Cordes <pcordes at gmail dot com> ---
(In reply to Jeffrey Walton from comment #23)
> (In reply to Peter Cordes from comment #22)
> > [...]
> > That instruction is useless and should never be used in asm except for
> > code-alignment reasons (1 byte longer than MOVLPS, same length as MOVSD, all
> > three doing the same thing for the memory-destination form).  But easy to
> > imagine some code using that intrinsic to store an unaligned double into a
> > byte buffer.
> 
> Reading from and writing to a [unaligned] byte stream in 4 or 8 byte chunks
> is our use case. Eventually, we need to perform traditional SIMD processing.
> But the loads and stores have to occur using these old instrinsics due to
> the word types, data stream format and supported ISA's.
> 
> I believe the other option is to memcpy the byte stream into a properly
> aligned intermediate buffer. But that could incur a performance hit if the
> optimizer misses the opportunity (and fails to elide the memcpy).


Apparently GCC has been "broken" for ages, making it UB to use misaligned
pointers with any of these intrinsics that only just now had their alignment
requirements removed.  And with _mm_storel_pd which is the same as before. 
Usually not resulting in miscompilation, though.

Going forward, simply avoid _mm_storel_pd.
Use _mm_store_sd (MOVSD) or _mm_storel_pi (MOVLPS) which have been fixed by
this patch.

_mm_store_sd derefs a  double_u  pointer, __attribute__((aligned(1),may_alias))

_mm_storel_pi uses __builtin_ia32_storelps
It didn't change in this patch, so presumably has been correct for longer.  If
you can put up with the amount of casting required to use it for the low double
of a __m128d (perhaps in a wrapper function that takes a void* and a vector),
_mm_storel_pi might be your best bet, unless there's anything weird about the
GCC internals for __builtin_ia32_storelps

The asm instruction you want is MOVLPS (1 byte shorter than the others in
non-AVX code) so it also has the advantage of hinting GCC to use that.

Reply via email to