https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88705
--- Comment #6 from Andrew Pinski <pinskia at gcc dot gnu.org> --- for the v4sf issue (v2sf has a similar issue too): (define_insn "vec_extract<mode>" [(set (match_operand:<V_elem> 0 "nonimmediate_operand" "=Um,r") (vec_select:<V_elem> (match_operand:VQ2 1 "s_register_operand" "w,w") (parallel [(match_operand:SI 2 "immediate_operand" "i,i")])))] That does not allow w (neon/vfp register) constraint as a dst so everything needs to go through GPRs. There is no vec_extract for V2DF which causes it to go through memory. For the init part, the biggest issue is neon_expand_vector_init falls back to doing everything in memory instead of doing "insertations" if there are a small number of elements (<= 4, though you could do some gpr logical operations to get to that number if needed): /* Construct the vector in memory one field at a time and load the whole vector. */ mem = assign_stack_temp (mode, GET_MODE_SIZE (mode)); for (i = 0; i < n_elts; i++) emit_move_insn (adjust_address_nv (mem, inner_mode, i * GET_MODE_SIZE (inner_mode)), XVECEXP (vals, 0, i)); emit_move_insn (target, mem); ----- CUT ---- Note aarch64_expand_vector_init has some interesting ideas that could be repeated here (and more due to the overlapping of lower d, q, and s registers).