[Neon intrinsics] Literal vector construction through vcombine is poor

Michael Collison Fri, 16 Jun 2017 14:09:06 -0700

This patch improves code generation for literal vector construction by 
expanding and exposing the pattern to rtl optimization earlier. The current 
implementation delays splitting the pattern until after reload which results in 
poor code generation for the following code:



#include "arm_neon.h"

int16x8_t
foo ()
{
  return vcombine_s16 (vdup_n_s16 (0), vdup_n_s16 (8));
}

Trunk generates:

foo:
        movi    v1.2s, 0
        movi    v0.4h, 0x8
        dup     d2, v1.d[0]
        ins     v2.d[1], v0.d[0]
        orr     v0.16b, v2.16b, v2.16b
        ret

With the patch we now generate:

foo:
        movi    v1.4h, 0x8
        movi    v0.4s, 0
        ins     v0.d[1], v1.d[0]
        ret

Bootstrapped and tested on aarch64-linux-gnu. Okay for trunk.

2017-06-15  Michael Collison  <michael.colli...@arm.com>

        * config/aarch64/aarch64-simd.md(aarch64_combine_internal<mode>):
        Convert from define_insn_and_split into define_expand
        * config/aarch64/aarch64.c(aarch64_split_simd_combine):
        Allow register and subreg operands.

pr7057.patch
Description: pr7057.patch

[Neon intrinsics] Literal vector construction through vcombine is poor

Reply via email to