https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88013
--- Comment #9 from krux <hoganmeier at gmail dot com> --- (In reply to ktkachov from comment #7) > I tried current trunk (future GCC 9) > GCC 9 learned to avoid excessive widening during vectorisation, which is > what accounts for the large number of instructions you see. Confirmed, the loop is now as described in comment #5 with trunk gcc. Still with vshr+vmovn as mentioned by Ramana. But by the way, the tail is completely unrolled, 15x the following, seems quite excessive to me: ldrb ip, [r1, #1] @ zero_extendqisi2 movs r6, #151 ldrb lr, [r1] @ zero_extendqisi2 movs r5, #77 ldrb r7, [r1, #2] @ zero_extendqisi2 movs r4, #28 smulbb ip, ip, r6 smlabb lr, r5, lr, ip add ip, r3, #1 smlabb r7, r4, r7, lr cmp ip, r2 asr r7, r7, #8 strb r7, [r0] bge .L1 assert(n >= 16) helps a bit, but n % 16 == 0 doesn't.