https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70130
--- Comment #1 from Bill Schmidt <wschmidt at gcc dot gnu.org> --- It's not clear to me from the report whether you have run this only on big-endian systems, or whether little-endian has been tried for Power8 (with -mcpu=power8). Can you please clarify? I ask because the -mcpu=power7 causes the versioned loop to use __builtin_altivec_mask_for_load to do the lvx/lvx/lvsl/vperm trick, whereas with -mcpu=power8 we would just have done unaligned loads. If there is a difference in endian behavior with -mcpu=power8 for BE and LE, that might be a clue to a back end problem.