https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68928

--- Comment #3 from Peter Cordes <peter at cordes dot ca> ---
I posted this as a question on stackoverflow, and got some useful comments (and
had some ideas while writing up a mask-gen answer).

http://stackoverflow.com/questions/34306933/vectorizing-with-unaligned-buffers-using-vmaskmovps-generating-a-mask-from-a-m

Stephen Canon points out that VMASKMOVPS isn't actually useful: you can instead
use unaligned loads/stores for the peeled first/last iteration, and do
overlapping work.  You just have to make sure you load any data you need before
clobbering it.  I posted an answer using that idea, but I'm not sure if it's
the sort of thing a compiler could decide to use.


For reduction loops where we need to accumulate each element exactly once, a
mask is still useful, but we can use it for ANDPS / ANDNPS instead of VMASKMOV.

I improved the mask-generation to a single AVX2 VPMOVSXBD load (with 5 or 7
single-uop integer instructions to generate the index from the start/end
address).  VPCMPGT isn't needed: instead just use an index to take the right
window of bytes from memory.  This emulates a variable-count VPSLLDQ on a
buffer of all-ones.

This is something gcc could maybe use, but probably some experimental testing
to compare with just using unaligned is warranted before spending any time
implementing automatic generation of something complicated like this.

Reply via email to