Keeping alignment when slicing is easy since it matches the size of the xmm registers: one has to partition the array by blocks of 2 doubles, 4 floats, etc. For AVX, the ideal alignment is on 32-byte boundaries but the really bad performance hit happens only when an unaligned access crosses a cacheline boundary. With SSE2, this concerns every single access.

Reply via email to