https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63341
--- Comment #6 from Jakub Jelinek <jakub at gcc dot gnu.org> --- Another testcase: typedef union U { unsigned short s; unsigned char c; } __attribute__((packed)) U; struct S { char e __attribute__((aligned (16))); U s[32]; }; struct S t = {0, {{0x5010}, {0x5111}, {0x5212}, {0x5313}, {0x5414}, {0x5515}, {0x5616}, {0x5717}, {0x5818}, {0x5919}, {0x5a1a}, {0x5b1b}, {0x5c1c}, {0x5d1d}, {0x5e1e}, {0x5f1f}, {0x6020}, {0x6121}, {0x6222}, {0x6323}, {0x6424}, {0x6525}, {0x6626}, {0x6727}, {0x6828}, {0x6929}, {0x6a2a}, {0x6b2b}, {0x6c2c}, {0x6d2d}, {0x6e2e}, {0x6f2f}}}; unsigned short d[32]; int main () { int i; for (i = 0; i < 32; i++) d[i] = t.s[i].s + 4; for (i = 0; i < 32; i++) if (d[i] != t.s[i].s + 4) __builtin_abort (); else asm volatile ("" : : : "memory"); return 0; } which fails similarly. For both testcases, if I manually change the addi 9,10,15 instruction to addi 9,10,16, it passes. That instruction corresponds to the *.vect created: vectp_t.9_3 = &MEM[(void *)&t + 15B]; and I change it to vectp_t.9_3 = &MEM[(void *)&t + 16B]; Here is what the vectorizer emits for this second testcase: <bb 2>: vectp_t.5_19 = &MEM[(void *)&t + 1B]; vectp_t.5_5 = vectp_t.5_19 & 4294967280B; vect__7.3_4 = MEM[(short unsigned int *)vectp_t.5_5]; vect__7.6_2 = __builtin_altivec_mask_for_load (vectp_t.5_19); vectp_t.9_3 = &MEM[(void *)&t + 15B]; vect_cst_.13_33 = { 4, 4, 4, 4, 4, 4, 4, 4 }; vectp_d.15_35 = &d; <bb 3>: # i_24 = PHI <i_10(4), 0(2)> # ivtmp_21 = PHI <ivtmp_20(4), 32(2)> # vect__7.7_1 = PHI <vect__7.10_31(4), vect__7.3_4(2)> # vectp_t.8_28 = PHI <vectp_t.8_29(4), vectp_t.9_3(2)> # vectp_d.14_36 = PHI <vectp_d.14_37(4), vectp_d.15_35(2)> # ivtmp_9 = PHI <ivtmp_39(4), 0(2)> vectp_t.8_30 = vectp_t.8_28 & 4294967280B; vect__7.10_31 = MEM[(short unsigned int *)vectp_t.8_30]; vect__7.11_32 = REALIGN_LOAD <vect__7.7_1, vect__7.10_31, vect__7.6_2>; _7 = t.s[i_24].s; vect__8.12_34 = vect__7.11_32 + vect_cst_.13_33; _8 = _7 + 4; MEM[(short unsigned int *)vectp_d.14_36] = vect__8.12_34; i_10 = i_24 + 1; ivtmp_20 = ivtmp_21 - 1; vectp_t.8_29 = vectp_t.8_28 + 16; vectp_d.14_37 = vectp_d.14_36 + 16; ivtmp_39 = ivtmp_9 + 1; if (ivtmp_39 < 4) goto <bb 4>; else goto <bb 5>; <bb 4>: goto <bb 3>; SSA_NAMEs _1, _31 aren't really used for anything but arguments for REALIGN_LOAD, so as long as the targets that support this (seems only rs6000 and spu) handle the misaligned units fine (it is about whether __builtin_altivec_mask_for_load computes the right mask for the permutations), I'd say just fixing up the offset should be all that is needed. Note that at least two of the three negative == true cases I saw on one x86_64 testcase use negative offset and depend on it not to have the low bits set (so offset -7 must become -14 for V8HImode and not -15). So, at least as a hack, adding step - 1 to offset in vect_create_addr_base_for_vector_ref if offset is non-NULL and positive might DTRT (because in all the 3 negative cases offset will be negative). But perhaps cleaner will be a bool flag that the offset is already in bytes or something similar.