On Tue, Nov 29, 2022 at 6:40 AM Prathamesh Kulkarni via Gcc-patches <gcc-patches@gcc.gnu.org> wrote: > > Hi, > For the following test-case: > > int16x8_t foo(int16_t x, int16_t y) > { > return (int16x8_t) { x, y, x, y, x, y, x, y }; > }
(Not to block this patch) Seems like this trick can be done even with less than perfect initializer too: e.g. int16x8_t foo(int16_t x, int16_t y) { return (int16x8_t) { x, y, x, y, x, y, x, 0 }; } Which should generate something like: dup v0.8h, w0 dup v1.8h, w1 zip1 v0.8h, v0.8h, v1.8h ins v0.h[7], wzr Thanks, Andrew Pinski > > Code gen at -O3: > foo: > dup v0.8h, w0 > ins v0.h[1], w1 > ins v0.h[3], w1 > ins v0.h[5], w1 > ins v0.h[7], w1 > ret > > For 16 elements, it results in 8 ins instructions which might not be > optimal perhaps. > I guess, the above code-gen would be equivalent to the following ? > dup v0.8h, w0 > dup v1.8h, w1 > zip1 v0.8h, v0.8h, v1.8h > > I have attached patch to do the same, if number of elements >= 8, > which should be possibly better compared to current code-gen ? > Patch passes bootstrap+test on aarch64-linux-gnu. > Does the patch look OK ? > > Thanks, > Prathamesh