https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86504

            Bug ID: 86504
           Summary: vectorization failure for a nest loop
           Product: gcc
           Version: 9.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: jiangning.liu at amperecomputing dot com
  Target Milestone: ---

Created attachment 44386
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44386&action=edit
bad vectorizatoin result for boundary size 16

For the case below, the code generated by “gcc -O3” is very ugly, and the inner
loop can be correctly vectorized. Please refer to attached file
test_loop_inner_16.s.

char g_d[1024], g_s1[1024], g_s2[1024];
void test_loop(void)
{
    char *d = g_d, *s1 = g_s1, *s2 = g_s2;

    for ( int y = 0; y < 128; y++ )
    {
        for ( int x = 0; x < 16; x++ )
            d[x] = s1[x] + s2[x];
        d += 16;
    }
}

If we change inner loop “for ( int x = 0; x < 16; x++ )” to be like “for ( int
x = 0; x < 32; x++ )”, i.e. the loop boundary size changes from 16 to 32, very
beautiful vectorization code would be generated. For example, the code below is
the aarch64 result for loop boundary size 32, and it the same case for x86.

test_loop:
.LFB0:
        .cfi_startproc
        adrp    x2, g_s1
        adrp    x3, g_s2
        add     x2, x2, :lo12:g_s1
        add     x3, x3, :lo12:g_s2
        adrp    x0, g_d
        adrp    x1, g_d+2048
        add     x0, x0, :lo12:g_d
        add     x1, x1, :lo12:g_d+2048
        ldp     q1, q2, [x2]
        ldp     q3, q0, [x3]
        add     v1.16b, v1.16b, v3.16b
        add     v0.16b, v0.16b, v2.16b
        .p2align 3,,7
.L2:
        str     q1, [x0]
        str     q0, [x0, 16]!
        cmp     x0, x1
        bne     .L2
        ret

The code generated for loop boundary size 8 is also very bad. 

Any idea?

Reply via email to