https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69282
--- Comment #9 from Jim Wilson <wilson at gcc dot gnu.org> --- (In reply to Andrew Pinski from comment #8) > (In reply to Jim Wilson from comment #7) > > The simplified testcases fail on arm if you use -O3 -mfpu=neon. > > > > I can look at fixing the arm side of things if we need an md patch. > > Try my attached patch and see what the code generation is. Looks like you changed options to -O2 -ftree-vectorize. On the aarch64 side I see ldr q0, [x0, x1] add x0, x0, 16 cmp x0, 128 cmeq v0.4s, v0.4s, #0 not v0.16b, v0.16b cmlt v0.4s, v0.4s, #0 bit v1.16b, v2.16b, v0.16b bic v3.16b, v3.16b, v0.16b add v2.4s, v2.4s, v4.4s and on the arm side I see vld1.32 {q8}, [r3] adds r3, r3, #16 cmp r2, r3 vceq.i32 q8, q10, q8 vbsl q8, q10, q14 vclt.s32 q8, q8, #0 vbit q9, q11, q8 vbit q12, q10, q8 vadd.i32 q11, q11, q13 There is a vbsl instruction in the arm output, but still the same number of instructions with the apparently unnecessary second vector compare.