https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68532

cbaylis at gcc dot gnu.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |ASSIGNED
   Last reconfirmed|                            |2015-11-25
                 CC|                            |cbaylis at gcc dot gnu.org
           Assignee|unassigned at gcc dot gnu.org      |cbaylis at gcc dot 
gnu.org
     Ever confirmed|0                           |1

--- Comment #1 from cbaylis at gcc dot gnu.org ---

This is a simplified test case, which shows the incorrect way elements are
being read from the vector loads done in the vectorized loop.

#include <stdio.h>
#include <stdalign.h>

#define SIZE 128
unsigned short alignas(16) in[SIZE];
unsigned short alignas(16) out[SIZE];

__attribute__ ((noinline)) void
test (unsigned short *restrict out, unsigned short *restrict in)
{
    int i;
    for (int j = 0; j < SIZE; j+=8)
      out[i++] = in[j];
}

int main()
{
    for (int i = 0; i<SIZE; i++) in[i] = i;
    test(out,in);
    for (int i = 0; i<SIZE/8; i++) printf("result: %d, expect %d\n",out[i],
i*8);
    return 0;
}

Results with armeb-unknown-linux-gnueabihf:
result: 36, expect 0    <- incorrect results from vectorized loop...
result: 44, expect 8
result: 52, expect 16
result: 60, expect 24
result: 4, expect 32
result: 12, expect 40
result: 20, expect 48
result: 28, expect 56
result: 64, expect 64    <- correct results come from fixup loop
result: 72, expect 72
result: 80, expect 80
result: 88, expect 88
result: 96, expect 96
result: 104, expect 104
result: 112, expect 112
result: 120, expect 120

The vectorized loop gives incorrect results, because it is compiled into a
sequence of vector loads with VUZP operations to extract every 8th element.

Looking at the generated assembly for test()
$ armeb-unknown-linux-gnueabihf-gcc -O1 -ftree-vectorize -fno-vect-cost-model
-mfpu=neon tvuzp2.c -S

test:
        @ args = 0, pretend = 0, frame = 0
        @ frame_needed = 0, uses_anonymous_args = 0
        @ link register save eliminated.
        orr     r2, r1, r0
        tst     r2, #15
        bne     .L2
        vld1.64 {d16-d17}, [r1:64]
        vldr    d18, [r1, #16]
        vldr    d19, [r1, #24]
        vuzp.16 q9, q8
        vldr    d18, [r1, #32]
        vldr    d19, [r1, #40]
        vldr    d20, [r1, #48]
        vldr    d21, [r1, #56]
        vuzp.16 q10, q9
        vuzp.16 q9, q8
        vldr    d18, [r1, #64]
        vldr    d19, [r1, #72]
        vldr    d20, [r1, #80]
        vldr    d21, [r1, #88]
        vuzp.16 q10, q9
        vldr    d20, [r1, #96]
        vldr    d21, [r1, #104]
        vldr    d22, [r1, #112]
        vldr    d23, [r1, #120]
        vuzp.16 q11, q10
        vuzp.16 q10, q9
        vuzp.16 q9, q8
        vst1.64 {d16-d17}, [r0:64]

We see that the 128 bit vectors are loaded either with vld1.64 or a pair of
vldr instructions if the addressing mode requires an additional index. This
means the vectors are loaded thus:

layout of input array in memory:
<in>   :  0       1       2       3       4       5       6       7
<in+16>:  8       9       10      11      12      13      14     15
<in+32>:  16      17      18      19      20      21      22     23
<in+48>:  24      25      26      27      28      29      30     31

layout of q8 after vld1.64 {d16-d17}, [r1:64] (showing architectural lanes 0 to
7, in ascending order)
    q8 = { 3, 2, 1, 0, 7, 6, 5, 4 }

So, the first 64 bits are loaded into lanes 0-3, but in big endian order, and
the 2nd 64 bits are loaded into lanes 4-7, also in big endian order.

The vectorizer expects that the result of loading a 128-bit vector into memory,
should be:
    q8 = { 7, 6, 5, 4, 3, 2, 1, 0 }
but the back end does not do that, in an attempt to support the EABI memory
ordering for 128 bit vectors. The behaviour of 128-bit vector loads/stores is
defined by output_move_neon(). There is a large comment in arm.c above
output_move_neon ("WARNING: The ordering of elements is weird in big-endian
mode...") which explains this.

The ARM back-end already uses a strategy of representing lane numbers in a "GCC
numbering", and transforming them to architectural numbering when generating
assembly. This is currently a simple reversal in big-endian mode, and a 1:1 map
in little endian.

I think this bug could be fixed by using a mapping lane numbers so that the
behaviour of output_move_neon() is taken into account. This would prevent the
vectorizer from using VUZP for the VEC_PERM_EXPR() in big-endian mode, because
the lanes specified would not match the behaviour of VUZP on 128 bit vectors.

Reply via email to