https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82189
--- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> --- We do better now: ldp s1, s3, [x1] dup v0.4s, v0.s[0] ldr s2, [x2, 4] ins v1.s[1], v3.s[0] ld1 {v1.s}[2], [x2] ins v1.s[3], v2.s[0] fdiv v1.4s, v1.4s, v0.4s str q1, [x0] _19 = {t_12(D), t_12(D), t_12(D), t_12(D)}; _1 = *b_9(D); _3 = MEM[(float *)b_9(D) + 4B]; _5 = *c_15(D); _7 = MEM[(float *)c_15(D) + 4B]; _20 = {_1, _3, _5, _7}; vect__2.3_18 = _20 / _19; MEM <vector(4) float> [(float *)a_11(D)] = vect__2.3_18; But we still don't Do the merging of the loads.