https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68532
cbaylis at gcc dot gnu.org changed: What |Removed |Added ---------------------------------------------------------------------------- Status|UNCONFIRMED |ASSIGNED Last reconfirmed| |2015-11-25 CC| |cbaylis at gcc dot gnu.org Assignee|unassigned at gcc dot gnu.org |cbaylis at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #1 from cbaylis at gcc dot gnu.org --- This is a simplified test case, which shows the incorrect way elements are being read from the vector loads done in the vectorized loop. #include <stdio.h> #include <stdalign.h> #define SIZE 128 unsigned short alignas(16) in[SIZE]; unsigned short alignas(16) out[SIZE]; __attribute__ ((noinline)) void test (unsigned short *restrict out, unsigned short *restrict in) { int i; for (int j = 0; j < SIZE; j+=8) out[i++] = in[j]; } int main() { for (int i = 0; i<SIZE; i++) in[i] = i; test(out,in); for (int i = 0; i<SIZE/8; i++) printf("result: %d, expect %d\n",out[i], i*8); return 0; } Results with armeb-unknown-linux-gnueabihf: result: 36, expect 0 <- incorrect results from vectorized loop... result: 44, expect 8 result: 52, expect 16 result: 60, expect 24 result: 4, expect 32 result: 12, expect 40 result: 20, expect 48 result: 28, expect 56 result: 64, expect 64 <- correct results come from fixup loop result: 72, expect 72 result: 80, expect 80 result: 88, expect 88 result: 96, expect 96 result: 104, expect 104 result: 112, expect 112 result: 120, expect 120 The vectorized loop gives incorrect results, because it is compiled into a sequence of vector loads with VUZP operations to extract every 8th element. Looking at the generated assembly for test() $ armeb-unknown-linux-gnueabihf-gcc -O1 -ftree-vectorize -fno-vect-cost-model -mfpu=neon tvuzp2.c -S test: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. orr r2, r1, r0 tst r2, #15 bne .L2 vld1.64 {d16-d17}, [r1:64] vldr d18, [r1, #16] vldr d19, [r1, #24] vuzp.16 q9, q8 vldr d18, [r1, #32] vldr d19, [r1, #40] vldr d20, [r1, #48] vldr d21, [r1, #56] vuzp.16 q10, q9 vuzp.16 q9, q8 vldr d18, [r1, #64] vldr d19, [r1, #72] vldr d20, [r1, #80] vldr d21, [r1, #88] vuzp.16 q10, q9 vldr d20, [r1, #96] vldr d21, [r1, #104] vldr d22, [r1, #112] vldr d23, [r1, #120] vuzp.16 q11, q10 vuzp.16 q10, q9 vuzp.16 q9, q8 vst1.64 {d16-d17}, [r0:64] We see that the 128 bit vectors are loaded either with vld1.64 or a pair of vldr instructions if the addressing mode requires an additional index. This means the vectors are loaded thus: layout of input array in memory: <in> : 0 1 2 3 4 5 6 7 <in+16>: 8 9 10 11 12 13 14 15 <in+32>: 16 17 18 19 20 21 22 23 <in+48>: 24 25 26 27 28 29 30 31 layout of q8 after vld1.64 {d16-d17}, [r1:64] (showing architectural lanes 0 to 7, in ascending order) q8 = { 3, 2, 1, 0, 7, 6, 5, 4 } So, the first 64 bits are loaded into lanes 0-3, but in big endian order, and the 2nd 64 bits are loaded into lanes 4-7, also in big endian order. The vectorizer expects that the result of loading a 128-bit vector into memory, should be: q8 = { 7, 6, 5, 4, 3, 2, 1, 0 } but the back end does not do that, in an attempt to support the EABI memory ordering for 128 bit vectors. The behaviour of 128-bit vector loads/stores is defined by output_move_neon(). There is a large comment in arm.c above output_move_neon ("WARNING: The ordering of elements is weird in big-endian mode...") which explains this. The ARM back-end already uses a strategy of representing lane numbers in a "GCC numbering", and transforming them to architectural numbering when generating assembly. This is currently a simple reversal in big-endian mode, and a 1:1 map in little endian. I think this bug could be fixed by using a mapping lane numbers so that the behaviour of output_move_neon() is taken into account. This would prevent the vectorizer from using VUZP for the VEC_PERM_EXPR() in big-endian mode, because the lanes specified would not match the behaviour of VUZP on 128 bit vectors.