For the case below, the code generated by “gcc -O3” is very ugly.


char g_d[1024], g_s1[1024], g_s2[1024];

void test_loop(void)

{

    char *d = g_d, *s1 = g_s1, *s2 = g_s2;



    for( int y = 0; y < 128; y++ )

    {

        for( int x = 0; x < 16; x++ )

            d[x] = s1[x] + s2[x];

        d += 16;

    }

}



If we change “for( int x = 0; x < 16; x++ )” to be like “for( int x = 0; x
< 32; x++ )”, very beautiful vectorization code would be generated,



test_loop:

.LFB0:

        .cfi_startproc

        adrp    x2, g_s1

        adrp    x3, g_s2

        add     x2, x2, :lo12:g_s1

        add     x3, x3, :lo12:g_s2

        adrp    x0, g_d

        adrp    x1, g_d+2048

        add     x0, x0, :lo12:g_d

        add     x1, x1, :lo12:g_d+2048

        ldp     q1, q2, [x2]

        ldp     q3, q0, [x3]

        add     v1.16b, v1.16b, v3.16b

        add     v0.16b, v0.16b, v2.16b

        .p2align 3,,7

.L2:

        str     q1, [x0]

        str     q0, [x0, 16]!

        cmp     x0, x1

        bne     .L2

        ret


The code generated for " for( int x = 0; x < 8; x++ )" is also very ugly.


It looks gcc has potential bugs on loop vectorization. Any idea?



Thanks,

-Jiangning

Reply via email to