https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88013

--- Comment #9 from krux <hoganmeier at gmail dot com> ---
(In reply to ktkachov from comment #7)
> I tried current trunk (future GCC 9)
> GCC 9 learned to avoid excessive widening during vectorisation, which is
> what accounts for the large number of instructions you see.

Confirmed, the loop is now as described in comment #5 with trunk gcc.
Still with vshr+vmovn as mentioned by Ramana.

But by the way, the tail is completely unrolled, 15x the following, seems quite
excessive to me:

        ldrb    ip, [r1, #1]    @ zero_extendqisi2
        movs    r6, #151
        ldrb    lr, [r1]        @ zero_extendqisi2
        movs    r5, #77
        ldrb    r7, [r1, #2]    @ zero_extendqisi2
        movs    r4, #28
        smulbb  ip, ip, r6
        smlabb  lr, r5, lr, ip
        add     ip, r3, #1
        smlabb  r7, r4, r7, lr
        cmp     ip, r2
        asr     r7, r7, #8
        strb    r7, [r0]
        bge     .L1

assert(n >= 16) helps a bit, but n % 16 == 0 doesn't.

Reply via email to